CS6023 GPU Programming

January 2024

Die-hard fans of GPU Programming
Photo courtesy: Isfarul

Important Links CUDA Slides
  1. Intro + Logistics
  2. Computation
  3. Memory
  4. Synchronization
  5. Functions
  6. Support
  7. Streams
  8. Topics
  9. Case Study -- Graphs

  Evaluation
Eval. ItemMarksDeadlines
StudentTA
A110Feb 11+10 days
A215Mar 3/10+10
A315Apr 7+10
A4/Project20Apr 28 30 May 2+10
MidSem20Mar 5+10
EndSem20May 6+10

Attendance
Standard institute rules apply.

Other details

  • Syllabus and structure
  • Prerequisite: CS2710 (PDS Lab) or Equivalent.
  • TAs: Ramya, Anup, Tarun, Anurag, Barenya, Mathew, Anukul, Chinmay, Khush, Shivam
  • Instructor: Rupesh Nasre.
  • Venue: SSB 134 CS15
  • Slot: A (Monday 8, Tuesday 12, Thursday 11, Friday 10)


Schedule
MonthDatesTopicComments
 January   18, 19  Introduction, Computation  
  • Hello World, One, Two, Three
  • Grid, Blocks, Threads
  • Kernel Launch: 1D, 1D-General, 2D
  •    22, 23, 25  Computation   Holiday on 26
  • CPU-GPU Communication (cudaMalloc, cudaMemcpy)
  • Global variables
  • Matrix mult.: CPU, Outer parallel, Outer+Inner parallel
  •    29, 30  Computation  
  • Thread Divergence
  • Divergence due to switch
  • Problem Set 1
  •  February  1, 2  Memory  
  • Memory Coalescing
  • AoS versus SoA
  • Barrier
  •    5, 6, 8, 9  Memory  
  • Linked List Copying
  • Shared Memory
  • Shared Memory with Barrier
  • String Permutation
  • Dynamic Shared Memory
  • Dynamic Shared Memory with Multiple Arrays

    CUDA GDB
  • Error Handling
  • Dangling Pointer
  •    12, 13, 15, 16  Memory  
  • Texture Memory (via CUDA SDK)
  • Constant Memory
  • Bank Conflicts
  • Problem Set 2

    NvProf
  • Original Code
  • Loop Fusion
  • Kernel Fusion
  • Converting Loop to Blocks
  •     19, 20, 22, 23  Synchronization  
  • Convolution
  • Worklist Insertion
  • Task Donation
  •     26, 27, 29  Synchronization   Class at 11:00 on 26
  • Reduction: i + N/2, N - i, i + 1
  • Prefix Sum / Scan
  •  March   1  Synchronization  
  • No Global Barrier
  • Global Barrier using Atomics
  • Hierarchical Global Barrier
  •     4, 5, 7  Synchronization   MidSem on 5.
  • Linked List Insertion
  • CPU-GPU Shared Pinned Memory
  • Persistent Kernel
  • Problem Set 3
  •     11, 12, 14, 15  Synchronization  
  • Array increment: Sequential, Parallel
  • Thrust basics
  • Thrust Reduction
  • Thrust Prefix Sum
  • Thrust-like device vector implementation
  •    18, 19, 21, 22  Functions  
  • Basic Stream Program
  • with Asynchronous memcpy
  • with cudaHostAlloc
  • Cooperative Kernels
  •    25, 26, 28  Functions  No class on 25
  • Dynamic Parallelism
  • Conditional Child Kernels
  • using Global Device Memory
  • with Non-Blocking Streams
  •  April  1, 2, 4, 5  Functions  
  • MultiGPU: Number of Devices
  • Cross-Device Synchronization
  •    8, 12  Topics  
  • PTX: CUDA Code, Assembly Code
  • Basic Warp Voting
  • Converting Mask to Count (popc)
  • Use of ffs
  • Conditional Participation in ballot
  •     15, 16, 18 19  Topics  Classes on DPC++ by Pradeep Ramachandran, KLA on 18 and 22.
    Institute holiday on 19.
  • Loop Unrolling, Unrolled Assembly
  • Heterogeneous Computation
  •     22, 23, 25, 26  Topics  Intel Workshop on April 25
  • with Shared Variable
  • Task Distribution
  • OpenMP Reduction
  • with HostAlloc'ed Memory
  • Dynamic Scheduling
  • OpenCL: Driver, Kernel
  •     29, 30  Case Study  
     May   2, 3  Buffer  
        6    EndSem from 10:00 -- 12:00


    GPU Programming Crossword Puzzle (click and type)

    Courtesy: crosswordlabs.com