GPU Programming with CUDA

NSM Nodal Centre for Training in HPC and AI

Home   |   Syllabus   |   Registration   |   Schedule

- history, graphics processors, graphics processing units, GPGPUs
- clock speeds, CPU / GPU comparisons, heterogeneity
- accelerators, parallel programming, CUDA / OpenCL / OpenACC, Hello World

- kernels, launch parameters
- thread hierarchy, warps / wavefronts, thread blocks / workgroups, streaming multiprocessors
- 1D / 2D / 3D thread mapping, device properties, simple programs

- memory hierarchy, DRAM / global, local / shared, private / local, textures, constant memory
- pointers, parameter passing, arrays and dynamic memory, multi-dimensional arrays
- memory allocation, memory copying across devices
- programs with matrices, performance evaluation with different memories

- memory consistency
- barriers (local versus global), atomics, memory fence
- prefix sum, reduction
- programs for concurrent data structures such as worklists, linked-lists
- synchronization across CPU and GPU

- device functions,host functions, kernels, functors
- using libraries (such as Thrust), developing libraries

- debugging GPU programs
- profiling, profile tools, performance aspects

- asynchronous processing, tasks, task-dependence
- overlapped data transfers, default stream, synchronization with streams
- events, event-based-synchronization
- overlapping data transfer and kernel execution, pitfalls

Case studies
- graph algorithms

Advanced topics
- dynamic parallelism
- unified virtual memory
- multi-GPU processing
- peer access
- heterogeneous processing