High-Performance Computing in Julia

Julia is fast, but only if you write it that way. This track builds the craft of high-performance computing in the language designed for it: type stability and why it lets the compiler generate tight code, memory layout and how to kill allocations, benchmarking that measures what matters, SIMD and vectorization, multithreading with race-free reductions, parallel map and reduce and scan, tasks and channels, the GPU programming model expressed over arrays, cache-friendly tiled numerics, and a high-performance data pipeline capstone. From a slow loop to code that saturates the machine.

11 projects, 275 hands-on levels, run in your browser.

Syllabus

Foundations: code through high-performance computing: Never written code before? Start here. You will learn the basics of Julia, functions, variables, types, decisions, loops, and arrays, with simple performance-minded examples. By the end you are ready for Project 1.
Type Stability & Performance: Julia is fast because it compiles specialized machine code for the concrete types it infers. When a function's return type cannot be predicted from its argument types, it is type-unstable, and the compiler falls back to slow, boxed, dynamic code. This project builds the foundation: reading types and inference, writing type-stable accumulators that return the element type they were given, keeping containers concretely typed, and using parametric methods and dispatch. Every level checks both the value and the type, the way you would confirm a kernel is fast.
Memory & Allocation: Allocation is where Julia programs lose their speed: every temporary array is a trip to the garbage collector. This project builds the habits that avoid it. Julia stores matrices column-major, so iterating the right way is cache-friendly. A view shares memory with its parent while a slice copies, which decides whether work allocates. Preallocating an output and writing into it with an in-place function, and fusing broadcasts so no temporaries appear, turn an allocating loop into one that touches no new memory. Every level is checked by value, and several by whether memory is shared.
Reductions & Scans: A reduction collapses a collection to one value; a scan produces the running results along the way. They are the workhorses of high-performance computing because, when the operation is associative, they split and recombine, which is exactly what lets them run in parallel. This project builds folds and reduce, prefix scans (cumulative sum, product, and max), tree reductions that exploit associativity for both speed and numerical accuracy, the mapreduce patterns that fuse a transform with a reduction, and a small reduction engine. Everything here is deterministic and sets up the parallel versions to come.
SIMD & Vectorization: A modern core can apply one instruction to several values at once, single instruction, multiple data. Julia exposes this with the @simd macro, which tells the compiler a reduction loop may be reordered and vectorized, and @inbounds, which drops bounds checks so the vectorizer can run. The other half is writing code the hardware likes: branch-free expressions that avoid misprediction, and fused multiply-add that does a multiply and an add in one rounded step. This project builds vectorizable reductions, branch-free kernels, Horner's method, and a fully vectorized numeric pass, every level graded on the value it computes.
Multithreading: A single core has limits; a machine has many. Julia's Base.Threads runs a loop across threads with @threads, and the trick to using it correctly is the same as in any language: avoid a shared mutable accumulator. This project builds parallel loops over disjoint output slots, race-free reductions that give each chunk its own partial result and combine them, atomic counters for genuinely shared state, and thread-local accumulation, finishing with a parallel reduction engine. Because each solution is correct for any number of threads, the tests assert exact values, the only honest way to test concurrency.
Parallel Algorithms: Some algorithms look sequential but have a parallel form. A prefix sum seems to need the previous result, yet a two-pass scheme, sum each chunk, prefix the chunk totals into offsets, then scan each chunk locally with its offset, parallelizes it. Filtering becomes count-offsets-scatter. Sorting reduces to merging sorted runs. This project builds those patterns explicitly: the parallel scan, parallel compaction, merging, parallel matrix reductions, and a pipeline that chains them. The structure is laid out by hand so you see exactly where the parallelism lives, and every result is checked against the obvious sequential answer.
Tasks & Channels: Below threads, Julia has Tasks: lightweight units of work you spawn and collect, and Channels: thread-safe queues for passing values between them. Threads.@spawn schedules a task on the thread pool and returns a handle; fetch waits for its result. A Channel decouples a producer that puts values from a consumer that takes them, with the producer closing the channel to signal it is done. This project builds task-based parallel maps and reductions, channel round-trips, producer-consumer flows, and a two-stage pipeline, every result made deterministic by fetching every task and consuming every item.
GPU-Style Kernels: A GPU runs the same kernel across thousands of threads, each identified by an index it uses to find its slice of the data. You do not need a GPU to learn the model: it is index arithmetic over arrays. This project builds elementwise kernels where each thread handles one element, the grid-and-block scheme that maps a global thread id to a data index (with a bounds guard for the leftover threads), grid-stride loops, two-level block reductions that mirror shared-memory reductions, and scatter and gather. Writing code in this shape is exactly what makes it portable to an actual GPU later, and every kernel here is checked by the values it produces.
Cache-Friendly Numerics: When the arithmetic is fixed, the remaining speed comes from the cache: code that walks memory in order runs many times faster than code that jumps around. This project builds the classic memory-bound kernels and the rearrangements that make them fast. Matrix multiply, naive and then tiled to reuse cached blocks. Stencils that sweep neighbors. The structure-of-arrays layout that keeps one field contiguous instead of scattering it through an array of structs. And the loop orderings that respect Julia's column-major storage. Every version computes the same answer, checked by value, so you can rearrange for locality with confidence.
Capstone: A High-Performance Pipeline: The finale assembles everything into one fast data pipeline over a numeric column: it transforms the data with a vectorized kernel, filters it, reduces it in parallel chunks, and reports statistics, type-stable and allocation-light throughout, and exact for any chunk count. You will build the dataset model, the transform stage, the filter stage, the parallel reduce stage, and finally the complete engine that chains them. By the end you have a small but real high-performance analytics engine, built from the type stability, memory discipline, vectorization, parallel reductions, and cache-aware numerics of the whole track.

Key concepts

@inbounds: A macro that removes array bounds checking in a loop, so the vectorizer is unobstructed. Safe only when the indices are provably valid; an out-of-bounds access…
Abstract type: A type used for dispatch and hierarchy (Number, Real, AbstractFloat) that cannot itself be instantiated. A container of an abstract type boxes its elements and…
Allocation: Reserving heap memory for a new array or boxed value. Every temporary array is work for the garbage collector, so high-performance Julia minimizes allocation w…
Array of structs: Storing records as a single array of multi-field elements, so the fields are interleaved in memory. Convenient, but a loop over one field strides past the othe…
Associativity: An operation is associative when grouping does not change the result: (a+b)+c equals a+(b+c). Associativity is what lets a reduction be split into chunks reduc…
Atomic: Threads.Atomic{T} is a shared value whose updates (atomic_add!, atomic_max!) are indivisible, so concurrent threads cannot lose them. Simpler than partials but…
Backpressure: When a bounded channel is full, put! blocks until the consumer makes room, naturally throttling a fast producer to the consumer's pace.
Bang convention (!): A trailing ! in a Julia function name signals that it mutates one of its arguments, usually the first. sort! sorts in place; push! appends.
Block reduction: A two-level reduction: each block reduces its tile into one partial value (the shared-memory step on a real GPU), then the per-block partials are reduced again…
Boxing: Storing a value behind a pointer with its type tag, as happens in an Any-typed container or a type-unstable variable. Boxed values force run-time type checks a…
Branch-free code: Code that avoids if inside a hot loop, using max, clamp, or mask arithmetic instead. Branches can stall the pipeline on misprediction and block vectorization;…
Broadcast fusion: Julia compiles a chain of dotted operations into one loop, so a .* b .+ c allocates no intermediate arrays. The key to allocation-light elementwise code.
Broadcasting: Applying an operation elementwise with a dot: xs .+ 1, sqrt.(xs). A chain of dots, or an @. expression, fuses into a single loop with no temporary arrays.
Cache: Fast memory close to the CPU that holds recently used data in lines. Code that walks memory in order reuses cache lines and runs far faster than code that jump…
Channel: A thread-safe queue connecting tasks: put! adds a value (blocking when full), take! removes one (blocking when empty), and close signals no more values, after…
Column-major order: Julia stores matrices column by column, so consecutive elements of a column are adjacent in memory. Iterating with the row index innermost is cache-friendly; v…
Concrete type: A type that can be instantiated, like Int or Float64. The compiler can generate fast code only for concrete types; abstract types (Number, Real) exist for disp…
Discrete Laplacian: The second-difference stencil xs[i-1] - 2*xs[i] + xs[i+1] (and its 2D 5-point form), approximating curvature. Used in diffusion and edge detection.
Dotted assignment: out .= a .+ b writes the fused elementwise result straight into a destination, with no temporary array. Combines broadcast fusion with in-place output.
eachindex: Yields the valid indices of a collection in the most efficient order and lets the compiler elide bounds checks. The idiomatic way to write a fast indexed loop.
eltype: The element type of a container: eltype([1.0, 2.0]) is Float64. Seeding an accumulator with zero(eltype(xs)) keeps a loop type-stable.
fetch: Blocks until a Task finishes and returns its result. Fetching every spawned task is what makes a task-based computation deterministic.
Fold: Threading an accumulator through a collection with a binary operation and an initial value. foldl goes left to right, foldr right to left; reduce is the same w…
Fused multiply-add: muladd(a, b, c) computes a*b + c, and on supporting hardware in a single rounded step that is both faster and more accurate. Horner's method evaluates poly…
Garbage collector: The runtime component that reclaims unused heap memory. Frequent allocation makes it run often, pausing computation, which is why reducing allocation is centra…
Gather: Reading an array at a list of indices: out[k] = xs[idx[k]]. Indirect reads appear in sparse operations, sorting, and permutations.
Global thread index: A thread's position across the whole grid: (block - 1) * blockdim + thread, all one-based. Each thread uses it to find the element it processes.
Grid and block: A GPU launch is a grid of blocks, each a group of threads. The grid is sized in whole blocks, so leftover threads past the data must guard their writes with a…
Grid-stride loop: When there are fewer threads than elements, each thread strides through the array by the total number of threads, covering tid, tid+total, tid+2*total, ... Tog…
Horner's method: Evaluating a polynomial as a nested chain of multiply-adds from the highest coefficient down, minimizing operations and rounding. acc = acc*x + c at each step.
In-place operation: A function (named with a trailing !) that writes its result into an array the caller provides instead of allocating a new one. A reused buffer turns repeated a…
Just-in-time compilation: Julia compiles each method to native code the first time it is called with a given set of argument types, then caches it. The first call pays compilation; late…
Kernel: In the GPU model, the function run by every thread, which uses its index to find its slice of the data. Over plain Julia arrays a kernel is just a loop where e…
Locality: Accessing memory in a contiguous, predictable order so the cache stays useful. In column-major Julia, an inner loop over the first (row) index is cache-friendl…
Loop order: The order of nested loops, which changes speed but not the result (for exact arithmetic): one order walks memory contiguously, another strides across it. Colum…
mapreduce: Fusing a transform with a reduction in one pass: map a function over each element, then combine with an operation. sum of squares is mapreduce(x -> x^2, +,…
Matrix multiply: The canonical cache problem: each output element needs a row and a column, and the loop order and tiling decide how much cached data is reused. The arithmetic…
Merge: Combining two sorted arrays into one sorted array with a two-pointer walk. Parallel sort sorts independent chunks and then merges the sorted runs.
Multiple dispatch: Julia selects a method by the types of all its arguments, and compiles a specialized version for each concrete combination. The mechanism behind both generic c…
nthreads: Threads.nthreads() reports how many threads Julia was started with (at least 1); Threads.threadid() gives the current thread's id. Set the count with the -…
Pairwise summation: Summing by recursively halving the array, which gives the same result as a naive sum for exact arithmetic but markedly less floating-point rounding error. Juli…
Parallel compaction: Filtering in parallel: count the matches per chunk, prefix the counts into write offsets, then each chunk scatters its kept elements into its own region of the…
Parallel scan: Computing a prefix sum in parallel in two passes: reduce each chunk to its total, take the exclusive prefix of the totals as per-chunk offsets, then each chunk…
Parallel sort: Sort each chunk independently (in parallel), then merge the sorted runs into one. Built on the merge of two sorted arrays.
Parametric method: A method written generically with a type parameter, f(x::T) where T, that the compiler specializes per concrete T. Generic source, fast specialized code.
Partial results: Per-chunk accumulators in a parallel reduction: each thread reduces its own slice into its own slot with no sharing, and the partials are combined once at the…
Permutation: Reordering an array by a permutation vector p: out[k] = xs[p[k]], a gather by p. Reversing is permuting by the reversed index range.
Preallocation: Allocating an output buffer once and reusing it across calls, so a hot loop allocates nothing. The pattern behind in-place kernels.
Prefix sum: The running total at each position of a vector. Inclusive forms include the current element; exclusive forms sum everything before it. Its parallel form is the…
Producer-consumer: A producer task puts values into a channel and a consumer takes them out, decoupling their rates with the channel's capacity providing backpressure. The pr…
Race condition: When the result depends on the unpredictable timing of threads sharing mutable state, typically a shared accumulator losing updates. The fix is per-chunk parti…
Reduction: Collapsing a collection to one value with a binary operation: sum, maximum, prod. When the operation is associative the reduction can be split across chunks an…
Scan: A prefix operation that returns every partial reduction, not just the last. cumsum is the prefix sum. A scan parallelizes in two passes: chunk totals, then loc…
Scatter: Writing values to positions given by indices: out[idx[k]] = vals[k]. When indices repeat, scatter-add accumulates; on a real GPU the add must be atomic.
SIMD: Single Instruction, Multiple Data: one CPU instruction applied to several values at once. The @simd macro permits the compiler to reorder and vectorize a reduc…
Slice: Indexing a range like xs[2:4], which allocates a fresh copy of that sub-array. Wrap it in @view to share memory instead.
Specialization: Julia compiles a separate, optimized version of a function for each concrete combination of argument types it is called with. This is why type-stable code is f…
Stencil: An update where each point is computed from its neighbors, swept over the array in order. Smoothing filters, discrete Laplacians, and explicit heat-equation st…
Structure of arrays: Storing each field of a record in its own contiguous array, so a loop over one field streams the cache. The opposite, an array of structs, interleaves fields a…
Task: A lightweight unit of work scheduled on Julia's thread pool. Threads.@spawn starts one and returns a handle; fetch waits for and returns its value.
Thread-local accumulation: Giving each thread its own accumulator (or array) that it alone touches, then merging once at the end. A per-chunk partials array is exactly this, and it gener…
Threads.@spawn: Schedules an expression as a Task on the thread pool and returns immediately with a handle. Spawning many and fetching all is a task-based parallel map or redu…
Threads.@threads: A macro that splits a for-loop's iterations across the available threads. Safe when each iteration writes its own output slot; a shared mutable accumulator…
Tiling (blocking): Processing a matrix in small blocks that fit in cache, reusing each cached block across the tile before moving on. Tiled matrix multiply computes the same answ…
Tree reduction: Reducing by recursively splitting the data, reducing each half, and combining, rather than a strict left-to-right fold. Exposes parallelism and, for floating p…
Type inference: The compiler's process of deducing the type of every value in a function from its inputs. When inference yields a concrete type at each step, the function…
Type promotion: Combining values of different types promotes them to a common type via promote_type; that is why 1 + 2.0 is a Float64. A loop that promotes on every iteration…
Type stability: A function is type-stable when its return type can be predicted from the types of its arguments. Julia then compiles tight, specialized machine code; without i…
Vectorization: Compiling a loop into wide SIMD instructions that process multiple elements per step. Requires a simple, reorderable loop body, which is why branch-free code v…
View: A window onto an existing array's memory (@view), sharing storage rather than copying. Operating on a sub-range through a view allocates nothing; a plain s…
zero and one: zero(T) and one(T) give the additive and multiplicative identities of a type. Using zero(eltype(xs)) to start an accumulator keeps it the right type instead of…