CS6023 GPU Programming

Syllabus and structure
Prerequisite: CS2710 (Programming and Data Structures Lab) or Equivalent.

Evaluation pattern: 60% assignments + 20% midsem + 20% endsem
Five assignments: A1 (7) + A2 (7) + A3 (13) + A4 (13) + C1 (20)

Moodle
Slides:

All codes

TCF rating:

course = 0.93 (institute mean 0.78)

instructor = 0.95 (institute mean 0.81)

TAs: Kavya, Anju, Sumit, Gaurav, Rajesh, Janakiram
Instructor: Rupesh Nasre.
Slot: C (Monday 10, Tuesday 9, Wednesday 8, Friday 12)
Venue: CS26

Deadlines

Eval. Item	Student	TA	Instructor
A1	Feb 9	Feb 19	--
A2	Mar 1	Mar 11	--
A3	Mar 15	Mar 25	--
A4	Apr 5 12	Apr 15 22	--
C1	~~Apr 26~~ May 10	~~May 6~~	--
MidSem	Mar 3	--	Mar 13
EndSem	~~May 4~~	--	~~May 14~~

Lectures

Month Dates Topic Comments

January 13, 14, 17 Introduction, Computation
Hello World, One, Two, Three
Grid, Blocks, Threads
Kernel Launch: 1D, 1D-General, 2D

20, 21, 22, 24 Computation
CPU-GPU Communication (cudaMalloc, cudaMemcpy)
Global variables
Matrix mult.: CPU, Outer parallel, Outer+Inner parallel

27, 28, 29, 31 Computation
Thread Divergence
Divergence due to switch
Problem Set 1

February 3, 4, 5, 7 Memory
Memory Coalescing
AoS versus SoA
Barrier

10, 11, 12, 14 Memory, Support
Linked List Copying
Shared Memory
Shared Memory with Barrier
String Permutation
Dynamic Shared Memory

CUDA GDB
Error Handling
Dangling Pointer

17, 18, 19, 21 Memory
Texture Memory (via CUDA SDK)
Constant Memory
Bank Conflicts
Problem Set 2

NvProf
Original Code
Loop Fusion
Kernel Fusion
Converting Loop to Blocks

24, 25, 26, 28 Synchronization
Convolution
Worklist Insertion
Task Donation

March 2, 3, 4, 6 Synchronization MidSem on 3 in CS24 + CS26
Reduction: i + N/2, N - i, i + 1
Prefix Sum / Scan

9, 11, 13 Synchronization
No Global Barrier
Global Barrier using Atomics
Hierarchical Global Barrier

16, 17₍₃₂₎, 18₍₁₂₎, 20₍₂₅₎ Synchronization Classes on skype_{(number of attendees)}
Linked List Insertion
CPU-GPU Shared Pinned Memory
Persistent Kernel
Problem Set 3

23₍₃₁₎, 24₍₃₁₎, 26₍₃₇₎, 27₍₃₃₎ Functions Class at ~~08:00~~ 10:00 on 23 ~~(Wednesday timetable)~~
Array increment: Sequential, Parallel
Thrust basics
Thrust Reduction
Thrust Prefix Sum
Thrust-like device vector implementation

30₍₃₂₎, 31₍₃₂₎ Streams
Basic Stream Program
with Asynchronous memcpy
with cudaHostAlloc
Cooperative Kernels

April 1₍₂₇₎, 2₍₃₂₎, 3₍₃₂₎ Topics
Dynamic Parallelism
Conditional Child Kernels
using Global Device Memory
with Non-Blocking Streams

6₍₂₇₎, 7₍₂₇₎, 8₍₂₀₎, 10₍₂₇₎ Topics
MultiGPU: Number of Devices
Cross-Device Synchronization
PTX: CUDA Code, Assembly Code
Basic Warp Voting
Converting Mask to Count (popc)
Use of ffs
Conditional Participation in ballot

13₍₃₀₎, 14₍₂₃₎, 15₍₁₈₎, 17₍₂₇₎, 18₍₂₃₎ Topics, Case Study ~~Class at 09:00 on 16 (Tuesday timetable)~~
Loop Unrolling, Unrolled Assembly
Heterogeneous Computation
with Shared Variable
Task Distribution
OpenMP Reduction
with HostAlloc'ed Memory
Dynamic Scheduling
OpenCL: Driver, Kernel

20, 21, 22, 24 Topics

27 Doubts session

May 4 ~~EndSem (09:00 -- 11:00 in CS24 + CS26)~~