GPU Computing and Graphics with Python

Name: GPU Computing and Graphics with Python
Availability: InStock

Learn the GPU and CUDA programming model and computer graphics from scratch: parallel kernels, the memory hierarchy, tiled matmul, transforms, rasterization, and ray tracing.

11 projects, 275 hands-on levels, run in your browser.

Syllabus

Foundations: code through GPU computing: Never written code before? Start here. You will learn the basics of Python, output, variables, types, decisions, loops, and functions, through threads, arrays, and parallel sums. By the end you are ready for Project 1.
The GPU Programming Model: A GPU runs the same small program, a kernel, across thousands of data elements at once. Build the mental model from the ground up: turn serial loops into parallel kernels, compute the thread indices that tell each parallel worker which element it owns, write kernels, and launch them over a grid of blocks and threads. Everything is simulated in numpy, but the model is exactly CUDA's.
Data-Parallel Patterns: Almost every GPU algorithm is built from a handful of parallel patterns. Build map (transform every element), reduce (combine all elements, in logarithmic depth), scan (running totals), and gather/scatter (data movement by index), then compose them. These patterns, not raw kernels, are how real parallel programs are designed.
The Memory Hierarchy: On a GPU, performance is usually decided by memory, not arithmetic. Build a model of the memory hierarchy (registers, shared, global), the coalescing that makes memory access fast or slow, shared-memory tiling that turns slow global reads into fast reuse, and arithmetic intensity with the roofline model that predicts whether a kernel is memory-bound or compute-bound.
Linear Algebra on the GPU: Matrix multiplication is the GPU's flagship workload, the engine of graphics and deep learning. Build it from the ground up: vector operations, the matrix-vector product, the naive matrix multiply as a grid of dot products, the tiled matmul that loads data into shared memory for reuse, and the arithmetic-intensity win that makes it compute-bound. The tiled matmul is the single most important GPU kernel.
Vectors and Transforms: Computer graphics is geometry, and geometry is vectors and matrices. Build 3D vectors (dot, cross, normalize), 4x4 matrices, the translate/scale/rotate transforms, homogeneous coordinates that unify them, and the model-view-projection pipeline that places a 3D point on a 2D screen. This is the math every GPU runs for every vertex of every frame.
The Rasterization Pipeline: Rasterization turns triangles into pixels, the algorithm behind real-time graphics. Build the triangle and its area, the edge function and barycentric coordinates that test whether a pixel is inside, the rasterizer that fills a triangle into an image, the z-buffer that resolves what is in front, and a complete triangle renderer that shades pixels by interpolation.
Ray Tracing: Ray tracing renders by following rays of light from the camera into the scene, the technique behind photorealistic images and modern RTX hardware. Build rays, ray-sphere intersection, surface normals and diffuse shading, the camera that casts a ray per pixel, and a complete ray tracer that renders a lit sphere. It is embarrassingly parallel, one independent ray per pixel.
Shaders and Image Processing: A shader is a tiny program run per pixel, and image processing is shaders applied to images. Build per-pixel kernels over a coordinate grid, color operations (grayscale, invert, gamma, contrast), 2D convolution and its filters (blur, edge, sharpen), the apply-kernel-to-image loop, and a filter pipeline. Every pixel is independent, the perfect GPU workload.
Particles and Physics on the GPU: Particle systems are the GPU's natural simulation workload: thousands of independent particles, each integrated forward in parallel. Build the particle state, Euler and Verlet integration, forces (gravity, drag, springs), an N-body gravitational simulation with the all-pairs kernel, and a fireworks capstone, all vectorized as data-parallel updates.
Capstone: A Complete GPU Renderer: The grand capstone of the track: assemble everything, the parallel model, transforms, rasterization, ray tracing, shading, and the data-parallel mindset, into a complete software renderer. Build the scene, transform geometry through the pipeline, rasterize with a z-buffer, shade with lighting, composite a full image, and reflect on how every stage is a GPU kernel.

Key concepts

Arithmetic intensity: The ratio of compute (FLOPs) to memory traffic (bytes). The roofline model uses it to tell whether a kernel is compute- or memory-bound.
Kernel: A function launched across a grid of threads, each running the same code on its own data element, the GPU programming model.
Memory coalescing: When consecutive threads read consecutive addresses, the hardware merges them into one memory transaction, the key to memory bandwidth.
N-body: Simulating mutual forces among N particles; the all-pairs version is O(n^2) and a classic compute-bound GPU workload.
Rasterization: Turning triangles into pixels: for each pixel, test coverage with barycentric coordinates and keep the nearest via a depth buffer.
Ray tracing: Rendering by shooting rays from the camera through pixels and intersecting scene geometry, then shading the hit (e.g., Lambert).
Roofline model: A chart bounding achievable performance by peak compute and memory bandwidth versus arithmetic intensity, showing the limiting resource.
Shared memory / tiling: Fast on-chip memory a block shares; loading a tile once and reusing it (tiling) cuts slow global-memory traffic, as in tiled matmul.
SIMT: Single Instruction, Multiple Threads: the GPU runs groups of threads (warps) in lockstep on the same instruction, the source of its throughput.
Thread, block, grid: The GPU launch hierarchy: threads are grouped into blocks, blocks into a grid. Each thread computes its global index to pick its data.