CS6023 GPU Programming

January 2023

Photo Courtesy: Anmol

Important Links

CUDA Slides

OneAPI / Sycl

TCF rating:

course = 0.9 (institute mean 0.79)

instructor = 0.95 (institute mean 0.81)

Evaluation

Eval. Item	Marks	Deadlines
Eval. Item	Marks	Student	TA	Instructor
A1	10	Feb 12	+10 days	--
A2	15	Mar 5	+10	--
A3	15	Apr 2	+10	--
A4/Project	20	Apr 23	+10	--

MidSem	20	Mar 2	--	+10
EndSem	20	May 1	--	+10

Attendance
Standard institute rules apply.

Other details

Syllabus and structure

Prerequisite: CS2710 (PDS Lab) or Equivalent.

TAs:

Instructor: Rupesh Nasre.

Venue: CS36

Slot: A (Monday 8, Tuesday 12, Thursday 11, Friday 10)

Schedule

Month Dates Topic Comments

January 16, 17, 19, 20 Introduction, Computation
Hello World, One, Two, Three
Grid, Blocks, Threads
Kernel Launch: 1D, 1D-General, 2D

23, 24, 27 Computation Holiday on 26
CPU-GPU Communication (cudaMalloc, cudaMemcpy)
Global variables
Matrix mult.: CPU, Outer parallel, Outer+Inner parallel

30, 31 Computation
Thread Divergence
Divergence due to switch
Problem Set 1

February 2, 3 Memory
Memory Coalescing
AoS versus SoA
Barrier

6, 7, 9, 10 Memory Lecture on 10 from Intel on oneAPI in CS25 from 9:30 to 12:30
Linked List Copying
Shared Memory
Shared Memory with Barrier
String Permutation
Dynamic Shared Memory
Dynamic Shared Memory with Multiple Arrays

CUDA GDB
Error Handling
Dangling Pointer

13, 14, 16, 17 Memory
Texture Memory (via CUDA SDK)
Constant Memory
Bank Conflicts
Problem Set 2

NvProf
Original Code
Loop Fusion
Kernel Fusion
Converting Loop to Blocks

20, 21, 23, 24 Synchronization
Convolution
Worklist Insertion
Task Donation

27, 28 Synchronization
Reduction: i + N/2, N - i, i + 1
Prefix Sum / Scan

March 2, 3 Synchronization MidSem on 2.
No Global Barrier
Global Barrier using Atomics
Hierarchical Global Barrier

6, 7, 9, 10 Synchronization
Linked List Insertion
CPU-GPU Shared Pinned Memory
Persistent Kernel
Problem Set 3

13, 14, 16, 17 Synchronization Guest lectures by Dr. Pradeep Ramachandran on Sycl / DPC++
Array increment: Sequential, Parallel
Thrust basics
Thrust Reduction
Thrust Prefix Sum
Thrust-like device vector implementation

20, 21, 23, 24 Synchronization
Basic Stream Program
with Asynchronous memcpy
with cudaHostAlloc
Cooperative Kernels

27, 28, 30, 31 Support, Functions Class at 10:00 on 27
Dynamic Parallelism
Conditional Child Kernels
using Global Device Memory
with Non-Blocking Streams

April 1 Functions
MultiGPU: Number of Devices
Cross-Device Synchronization

3, 6 Topics Holidays on 4 and 7
PTX: CUDA Code, Assembly Code
Basic Warp Voting
Converting Mask to Count (popc)
Use of ffs
Conditional Participation in ballot

10, 11, 13, 14 Topics
Loop Unrolling, Unrolled Assembly
Heterogeneous Computation

17, 18, 20, 21 Topics
with Shared Variable
Task Distribution
OpenMP Reduction
with HostAlloc'ed Memory
Dynamic Scheduling
OpenCL: Driver, Kernel

24, 25, 27 Case Study

May 1 EndSem from 9:00 -- 12:00

GPU Programming Crossword Puzzle (click and type)

Courtesy: crosswordlabs.com

Happy faces -- since the course was coming to an end (Photo Courtesy: Anmol)