CS6023 GPU Programming

January 2024

Die-hard fans of GPU Programming
Photo courtesy: Isfarul

Important Links

CUDA Slides

Evaluation

Eval. Item	Marks	Deadlines
Eval. Item	Marks	Student	TA
A1	10	Feb 11	+10 days
A2	15	Mar 3/10	+10
A3	15	Apr 7	+10
A4/Project	20	~~Apr 28 30~~ May 2	+10

MidSem	20	Mar 5	+10
EndSem	20	May 6	+10

Attendance
Standard institute rules apply.

Other details

Syllabus and structure

Prerequisite: CS2710 (PDS Lab) or Equivalent.

TAs: Ramya, Anup, Tarun, Anurag, Barenya, Mathew, Anukul, Chinmay, Khush, Shivam

Instructor: Rupesh Nasre.

Venue: ~~SSB 134~~ CS15

Slot: A (Monday 8, Tuesday 12, Thursday 11, Friday 10)

Schedule

Month Dates Topic Comments

January 18, 19 Introduction, Computation
Hello World, One, Two, Three
Grid, Blocks, Threads
Kernel Launch: 1D, 1D-General, 2D

22, 23, 25 Computation Holiday on 26
CPU-GPU Communication (cudaMalloc, cudaMemcpy)
Global variables
Matrix mult.: CPU, Outer parallel, Outer+Inner parallel

29, 30 Computation
Thread Divergence
Divergence due to switch
Problem Set 1

February 1, 2 Memory
Memory Coalescing
AoS versus SoA
Barrier

5, 6, 8, 9 Memory
Linked List Copying
Shared Memory
Shared Memory with Barrier
String Permutation
Dynamic Shared Memory
Dynamic Shared Memory with Multiple Arrays

CUDA GDB
Error Handling
Dangling Pointer

12, 13, 15, 16 Memory
Texture Memory (via CUDA SDK)
Constant Memory
Bank Conflicts
Problem Set 2

NvProf
Original Code
Loop Fusion
Kernel Fusion
Converting Loop to Blocks

19, 20, 22, 23 Synchronization
Convolution
Worklist Insertion
Task Donation

26, 27, 29 Synchronization Class at 11:00 on 26
Reduction: i + N/2, N - i, i + 1
Prefix Sum / Scan

March 1 Synchronization
No Global Barrier
Global Barrier using Atomics
Hierarchical Global Barrier

4, 5, 7 Synchronization MidSem on 5.
Linked List Insertion
CPU-GPU Shared Pinned Memory
Persistent Kernel
Problem Set 3

11, 12, 14, 15 Synchronization
Array increment: Sequential, Parallel
Thrust basics
Thrust Reduction
Thrust Prefix Sum
Thrust-like device vector implementation

18, 19, 21, 22 Functions
Basic Stream Program
with Asynchronous memcpy
with cudaHostAlloc
Cooperative Kernels

25, 26, 28 Functions No class on 25
Dynamic Parallelism
Conditional Child Kernels
using Global Device Memory
with Non-Blocking Streams

April 1, 2, 4, 5 Functions
MultiGPU: Number of Devices
Cross-Device Synchronization

8, 12 Topics
PTX: CUDA Code, Assembly Code
Basic Warp Voting
Converting Mask to Count (popc)
Use of ffs
Conditional Participation in ballot

15, 16, 18 19 Topics Classes on DPC++ by Pradeep Ramachandran, KLA on 18 and 22.
Institute holiday on 19.
Loop Unrolling, Unrolled Assembly
Heterogeneous Computation

22, 23, 25, 26 Topics Intel Workshop on April 25
with Shared Variable
Task Distribution
OpenMP Reduction
with HostAlloc'ed Memory
Dynamic Scheduling
OpenCL: Driver, Kernel

29, 30 Case Study

May 2, 3 Buffer

6 EndSem from 10:00 -- 12:00

GPU Programming Crossword Puzzle (click and type)

Courtesy: crosswordlabs.com