CS6023 GPU Programming


  • Moodle
  • Video Lectures
  • All codes
  • FAQ one and two

    Evaluation pattern:

    • 60% assignments + 20% midsem + 20% endsem
    • 5 assignments: A1(7) + A2(7) + A3(13) + A4(13) + C1(20)
    • Course project demo: presentation with 3 slides (title + names + roll numbers, problem statement + results, challenges faced and how you resolved those), followed by running the code using screen sharing. You will be asked to change inputs and rerun the code.
    Slides:
    1. Intro + Logistics
    2. Computation
    3. Memory
    4. Synchronization
    5. Functions
    6. Support
    7. Streams
    8. Topics
    9. Case Study -- Graphs

    TCF rating:

  • course = 0.89 (institute mean 0.80)
  • instructor = 0.91 (institute mean 0.82)
  • Full report
  •   Deadlines
    Eval. ItemStudentTAInstructor
    A1Feb 21+10 days--
    A2Mar 7+10--
    A3Apr 4 6 7+10--
    A4Apr 18 20+10--
    C1May 2 7+10--
    MidSemMar 19--+10
    EndSemMay 17--+10

  • Syllabus and structure
  • Prerequisite: CS2710 (PDS Lab) or Equivalent.
  • TAs: Jash, Pawan, Anshul, Shanu, Saurav, Panga, Keval, Shouvick.
  • Instructor: Rupesh Nasre.
  • Slot: C (Monday 10, Tuesday 9, Wednesday 8, Friday 12)
  • Venue: CS26 Google Meet

  • Lectures
    MonthDatesTopicComments
     February   1, 2, 3, 5  Introduction, Computation  
  • Hello World, One, Two, Three
  • Grid, Blocks, Threads
  • Kernel Launch: 1D, 1D-General, 2D
  •    8, 9, 10, 12  Computation  
  • CPU-GPU Communication (cudaMalloc, cudaMemcpy)
  • Global variables
  • Matrix mult.: CPU, Outer parallel, Outer+Inner parallel
  •    15, 16, 17, 19  Computation  
  • Thread Divergence
  • Divergence due to switch
  • Problem Set 1
  •    22, 23, 24, 26  Memory  
  • Memory Coalescing
  • AoS versus SoA
  • Barrier
  •  March  1, 2, 3, 5  Memory, Support  
  • Linked List Copying
  • Shared Memory
  • Shared Memory with Barrier
  • String Permutation
  • Dynamic Shared Memory
  • Dynamic Shared Memory with Multiple Arrays

    CUDA GDB
  • Error Handling
  • Dangling Pointer
  •    8, 9, 10, 12  Memory  
  • Texture Memory (via CUDA SDK)
  • Constant Memory
  • Bank Conflicts
  • Problem Set 2

    NvProf
  • Original Code
  • Loop Fusion
  • Kernel Fusion
  • Converting Loop to Blocks
  •     15, 16, 17, 19  Synchronization  
  • Convolution
  • Worklist Insertion
  • Task Donation
  •     22, 23, 24, 26  Synchronization  
  • Reduction: i + N/2, N - i, i + 1
  • Prefix Sum / Scan
  •     29, 30, 31  Synchronization  No classes due to semester break.
  • No Global Barrier
  • Global Barrier using Atomics
  • Hierarchical Global Barrier
  •  April   2  Synchronization  No classes due to semester break.
  • Linked List Insertion
  • CPU-GPU Shared Pinned Memory
  • Persistent Kernel
  • Problem Set 3
  •     5, 6, 7, 9  Functions  
  • Array increment: Sequential, Parallel
  • Thrust basics
  • Thrust Reduction
  • Thrust Prefix Sum
  • Thrust-like device vector implementation
  •    12, 13, 14, 16  Streams  
  • Basic Stream Program
  • with Asynchronous memcpy
  • with cudaHostAlloc
  • Cooperative Kernels
  •    19, 20, 21, 23  Topics  
  • Dynamic Parallelism
  • Conditional Child Kernels
  • using Global Device Memory
  • with Non-Blocking Streams
  •    26, 27, 28, 30  Topics  
  • MultiGPU: Number of Devices
  • Cross-Device Synchronization
  • PTX: CUDA Code, Assembly Code
  • Basic Warp Voting
  • Converting Mask to Count (popc)
  • Use of ffs
  • Conditional Participation in ballot
  •  May  3, 4, 5, 7  Topics, Case Study  
  • Loop Unrolling, Unrolled Assembly
  • Heterogeneous Computation
  • with Shared Variable
  • Task Distribution
  • OpenMP Reduction
  • with HostAlloc'ed Memory
  • Dynamic Scheduling
  • OpenCL: Driver, Kernel