GPU Computing and Graphics with Python

Learn the GPU and CUDA programming model and computer graphics from scratch: parallel kernels, the memory hierarchy, tiled matmul, transforms, rasterization, and ray tracing.

10 projects, 250 hands-on levels, run in your browser.

Syllabus

  • The GPU Programming Model: A GPU runs the same small program, a kernel, across thousands of data elements at once. Build the mental model from the ground up: turn serial loops into parallel kernels, compute the thread indices that tell each parallel worker which element it owns, write kernels, and launch them over a grid of blocks and threads. Everything is simulated in numpy, but the model is exactly CUDA's.
  • Data-Parallel Patterns: Almost every GPU algorithm is built from a handful of parallel patterns. Build map (transform every element), reduce (combine all elements, in logarithmic depth), scan (running totals), and gather/scatter (data movement by index), then compose them. These patterns, not raw kernels, are how real parallel programs are designed.
  • The Memory Hierarchy: On a GPU, performance is usually decided by memory, not arithmetic. Build a model of the memory hierarchy (registers, shared, global), the coalescing that makes memory access fast or slow, shared-memory tiling that turns slow global reads into fast reuse, and arithmetic intensity with the roofline model that predicts whether a kernel is memory-bound or compute-bound.
  • Linear Algebra on the GPU: Matrix multiplication is the GPU's flagship workload, the engine of graphics and deep learning. Build it from the ground up: vector operations, the matrix-vector product, the naive matrix multiply as a grid of dot products, the tiled matmul that loads data into shared memory for reuse, and the arithmetic-intensity win that makes it compute-bound. The tiled matmul is the single most important GPU kernel.
  • Vectors and Transforms: Computer graphics is geometry, and geometry is vectors and matrices. Build 3D vectors (dot, cross, normalize), 4x4 matrices, the translate/scale/rotate transforms, homogeneous coordinates that unify them, and the model-view-projection pipeline that places a 3D point on a 2D screen. This is the math every GPU runs for every vertex of every frame.
  • The Rasterization Pipeline: Rasterization turns triangles into pixels, the algorithm behind real-time graphics. Build the triangle and its area, the edge function and barycentric coordinates that test whether a pixel is inside, the rasterizer that fills a triangle into an image, the z-buffer that resolves what is in front, and a complete triangle renderer that shades pixels by interpolation.
  • Ray Tracing: Ray tracing renders by following rays of light from the camera into the scene, the technique behind photorealistic images and modern RTX hardware. Build rays, ray-sphere intersection, surface normals and diffuse shading, the camera that casts a ray per pixel, and a complete ray tracer that renders a lit sphere. It is embarrassingly parallel, one independent ray per pixel.
  • Shaders and Image Processing: A shader is a tiny program run per pixel, and image processing is shaders applied to images. Build per-pixel kernels over a coordinate grid, color operations (grayscale, invert, gamma, contrast), 2D convolution and its filters (blur, edge, sharpen), the apply-kernel-to-image loop, and a filter pipeline. Every pixel is independent, the perfect GPU workload.
  • Particles and Physics on the GPU: Particle systems are the GPU's natural simulation workload: thousands of independent particles, each integrated forward in parallel. Build the particle state, Euler and Verlet integration, forces (gravity, drag, springs), an N-body gravitational simulation with the all-pairs kernel, and a fireworks capstone, all vectorized as data-parallel updates.
  • Capstone: A Complete GPU Renderer: The grand capstone of the track: assemble everything, the parallel model, transforms, rasterization, ray tracing, shading, and the data-parallel mindset, into a complete software renderer. Build the scene, transform geometry through the pipeline, rasterize with a z-buffer, shade with lighting, composite a full image, and reflect on how every stage is a GPU kernel.

Key concepts

  • Arithmetic intensity: The ratio of compute (FLOPs) to memory traffic (bytes). The roofline model uses it to tell whether a kernel is compute- or memory-bound.
  • Kernel: A function launched across a grid of threads, each running the same code on its own data element, the GPU programming model.
  • Memory coalescing: When consecutive threads read consecutive addresses, the hardware merges them into one memory transaction, the key to memory bandwidth.
  • N-body: Simulating mutual forces among N particles; the all-pairs version is O(n^2) and a classic compute-bound GPU workload.
  • Rasterization: Turning triangles into pixels: for each pixel, test coverage with barycentric coordinates and keep the nearest via a depth buffer.
  • Ray tracing: Rendering by shooting rays from the camera through pixels and intersecting scene geometry, then shading the hit (e.g., Lambert).
  • Roofline model: A chart bounding achievable performance by peak compute and memory bandwidth versus arithmetic intensity, showing the limiting resource.
  • Shared memory / tiling: Fast on-chip memory a block shares; loading a tile once and reusing it (tiling) cuts slow global-memory traffic, as in tiled matmul.
  • SIMT: Single Instruction, Multiple Threads: the GPU runs groups of threads (warps) in lockstep on the same instruction, the source of its throughput.
  • Thread, block, grid: The GPU launch hierarchy: threads are grouped into blocks, blocks into a grid. Each thread computes its global index to pick its data.