Deep Learning with Python
Build modern AI from scratch: neural networks, backprop, CNNs, transformers, a tiny GPT.
10 projects, 250 hands-on levels, run in your browser.
Syllabus
- Foundations: From Regression to the Neuron: Deep learning is built on two ideas you can learn from scratch: a model that makes predictions, and gradient descent that tunes it from data. Start with linear regression, sharpen gradient descent, then discover that a single artificial neuron is just logistic regression. By the end you will train a neuron to learn a logic gate, the atom of every neural network.
- Neural Networks: One neuron is logistic regression; a network is many of them, stacked into layers. Build the dense layer as a matrix operation, chain layers into a multilayer perceptron, add softmax for multi-class outputs, and discover why depth matters by solving XOR, the problem a single neuron famously cannot. By the end you have a general feed-forward network.
- Backpropagation and Autograd: Backpropagation is the algorithm that lets a network learn: it computes the gradient of the loss with respect to every weight by applying the chain rule backward through the layers. Build it from the chain rule up, through a neuron, through a full network, then build a tiny autograd engine, and finally train the XOR network to learn its own weights.
- Training Neural Networks: A network learns only as well as the procedure that updates its weights. Build the loss functions that score predictions, the optimizers that turn gradients into weight updates (SGD, momentum, RMSprop, Adam), and the training loop that ties them together over mini-batches and epochs. By the end you train a small network with Adam and watch the loss fall.
- Deep Classification: Move from fitting numbers to making decisions. Build the softmax classifier and its clean cross-entropy gradient, train a multi-class network on real data, then face the central problem of machine learning: overfitting. Build L2 regularization and dropout to fight it, and the train/validation split and metrics that reveal it. The capstone trains and honestly evaluates a regularized classifier.
- Convolutional Networks: Images have spatial structure that dense layers ignore. Convolutional networks exploit it by sliding small learnable filters across the image to build feature maps. Build convolution from scratch, the edge-detecting filters it learns, pooling that summarizes regions, a convolutional layer with its backward pass, and finally a tiny CNN that learns to classify images from raw pixels.
- Sequences and Recurrent Networks: Text, audio, and time series are sequences, and their meaning depends on order. Recurrent networks process them one step at a time while carrying a hidden state that remembers the past. Build the RNN cell, unroll it over a sequence, derive backpropagation through time, confront the vanishing-gradient problem, and train a character-level RNN that learns a pattern and generates it.
- Tokenization and Embeddings: Before a model can read text it must turn it into numbers. Tokenization splits text into units and maps them to integer ids; embeddings turn those ids into dense, learnable vectors that place similar tokens near each other. Build character, word, and byte tokenizers, the embedding table and its gradient, vector similarity and analogies, and train embeddings that learn meaning from a task.
- Attention and Transformers: Attention lets every token look directly at every other token and decide what is relevant, removing the recurrent bottleneck entirely. Build scaled dot-product attention from queries, keys, and values; add causal masking and multiple heads; add positional encoding so order matters; and assemble the transformer block with layer norm, residual connections, and a feed-forward sublayer. This is the architecture behind every modern language model.
- Capstone: A Tiny GPT: The final project assembles everything into a working GPT: a character-level transformer language model you train end to end with your own backpropagation, then sample from to generate text. Build the full forward pass, derive the backward pass piece by piece (loss, feed-forward, attention), train it to model a tiny corpus, and watch it generate the text it learned. This is a real GPT, only small.
Key concepts
- Activation function: A nonlinearity (ReLU, sigmoid, tanh) applied to a neuron's output, letting networks model nonlinear relationships.
- Attention: A mechanism that lets a model weigh which other positions in a sequence to focus on; the heart of transformers.
- Backpropagation: The algorithm that computes the gradient of the loss with respect to every weight by applying the chain rule backward through the network.
- Convolution (CNN): Sliding a small kernel over an image to detect local patterns; the core operation of convolutional networks.
- Gradient descent: Updating weights in the direction that most reduces the loss, scaled by a learning rate.
- Loss function: A measure of prediction error (e.g., cross-entropy, MSE) that training minimizes.
- Neuron: A unit that computes a weighted sum of inputs plus a bias, then applies an activation function; networks stack many.
- Recurrent network (RNN): A network with a hidden state carried across a sequence, letting it model order and context.
- Softmax: Turns a vector of scores into a probability distribution (positive, summing to 1), used for classification outputs.
- Transformer: An architecture built on self-attention (no recurrence) that powers modern language models like GPT.
- Weights and bias: The learnable parameters of a network; training adjusts them to reduce the loss.