CS6023 GPU Programming


Syllabus and structure
Prerequisite: CS2710 (Programming and Data Structures Lab) or Equivalent.

Evaluation pattern: 60% assignments + 20% midsem + 20% endsem
Five assignments: A1 (7) + A2 (7) + A3 (13) + A4 (13) + C1 (20)

Moodle
Slides:

  1. Intro + Logistics
  2. Computation
  3. Memory
  4. Synchronization
  5. Functions
  6. Support
  7. Streams
  8. Topics
  9. Case Study -- Graphs
All codes

TCF rating:

  • course = 0.93 (institute mean 0.78)
  • instructor = 0.95 (institute mean 0.81)
  • Full report

    TAs: Kavya, Anju, Sumit, Gaurav, Rajesh, Janakiram
    Instructor: Rupesh Nasre.
    Slot: C (Monday 10, Tuesday 9, Wednesday 8, Friday 12)
    Venue: CS26

  •   Deadlines
    Eval. ItemStudentTAInstructor
    A1Feb 9Feb 19--
    A2Mar 1Mar 11--
    A3Mar 15Mar 25--
    A4Apr 5 12Apr 15 22--
    C1Apr 26 May 10May 6--
    MidSemMar 3 --Mar 13
    EndSemMay 4--May 14

    Lectures
    MonthDatesTopicComments
     January  13, 14, 17  Introduction, Computation  
  • Hello World, One, Two, Three
  • Grid, Blocks, Threads
  • Kernel Launch: 1D, 1D-General, 2D
  •    20, 21, 22, 24  Computation  
  • CPU-GPU Communication (cudaMalloc, cudaMemcpy)
  • Global variables
  • Matrix mult.: CPU, Outer parallel, Outer+Inner parallel
  •     27, 28, 29, 31  Computation  
  • Thread Divergence
  • Divergence due to switch
  • Problem Set 1
  •  February  3, 4, 5, 7  Memory  
  • Memory Coalescing
  • AoS versus SoA
  • Barrier
  •    10, 11, 12, 14  Memory, Support  
  • Linked List Copying
  • Shared Memory
  • Shared Memory with Barrier
  • String Permutation
  • Dynamic Shared Memory

    CUDA GDB
  • Error Handling
  • Dangling Pointer
  •    17, 18, 19, 21  Memory  
  • Texture Memory (via CUDA SDK)
  • Constant Memory
  • Bank Conflicts
  • Problem Set 2

    NvProf
  • Original Code
  • Loop Fusion
  • Kernel Fusion
  • Converting Loop to Blocks
  •    24, 25, 26, 28  Synchronization  
  • Convolution
  • Worklist Insertion
  • Task Donation
  •  March  2, 3, 4, 6  Synchronization   MidSem on 3 in CS24 + CS26
  • Reduction: i + N/2, N - i, i + 1
  • Prefix Sum / Scan
  •     9, 11, 13  Synchronization  
  • No Global Barrier
  • Global Barrier using Atomics
  • Hierarchical Global Barrier
  •     16, 17(32), 18(12), 20(25)  Synchronization  Classes on skype(number of attendees)
  • Linked List Insertion
  • CPU-GPU Shared Pinned Memory
  • Persistent Kernel
  • Problem Set 3
  •     23(31), 24(31), 26(37), 27(33)  Functions   Class at 08:00 10:00 on 23 (Wednesday timetable)
  • Array increment: Sequential, Parallel
  • Thrust basics
  • Thrust Reduction
  • Thrust Prefix Sum
  • Thrust-like device vector implementation
  •     30(32), 31(32)  Streams  
  • Basic Stream Program
  • with Asynchronous memcpy
  • with cudaHostAlloc
  • Cooperative Kernels
  •  April   1(27), 2(32), 3(32)  Topics  
  • Dynamic Parallelism
  • Conditional Child Kernels
  • using Global Device Memory
  • with Non-Blocking Streams
  •    6(27), 7(27), 8(20), 10(27)  Topics  
  • MultiGPU: Number of Devices
  • Cross-Device Synchronization
  • PTX: CUDA Code, Assembly Code
  • Basic Warp Voting
  • Converting Mask to Count (popc)
  • Use of ffs
  • Conditional Participation in ballot
  •    13(30), 14(23), 15(18), 17(27), 18(23)  Topics, Case Study  Class at 09:00 on 16 (Tuesday timetable)
  • Loop Unrolling, Unrolled Assembly
  • Heterogeneous Computation
  • with Shared Variable
  • Task Distribution
  • OpenMP Reduction
  • with HostAlloc'ed Memory
  • Dynamic Scheduling
  • OpenCL: Driver, Kernel
  •    20, 21, 22, 24  Topics  
       27  Doubts session  
     May   4     EndSem (09:00 -- 11:00 in CS24 + CS26)