Deep Learning with Python

Name: Deep Learning with Python
Availability: InStock

Build modern AI from scratch: neural networks, backprop, CNNs, transformers, a tiny GPT.

11 projects, 275 hands-on levels, run in your browser.

Syllabus

Foundations: code through AI: Never written code before? Start here. You will learn the absolute basics of Python, output, variables, types, decisions, loops, and functions, using neural networks, weights, and predictions as your playground. By the end you are ready for Project 1.
Foundations: From Regression to the Neuron: Deep learning is built on two ideas you can learn from scratch: a model that makes predictions, and gradient descent that tunes it from data. Start with linear regression, sharpen gradient descent, then discover that a single artificial neuron is just logistic regression. By the end you will train a neuron to learn a logic gate, the atom of every neural network.
Neural Networks: One neuron is logistic regression; a network is many of them, stacked into layers. Build the dense layer as a matrix operation, chain layers into a multilayer perceptron, add softmax for multi-class outputs, and discover why depth matters by solving XOR, the problem a single neuron famously cannot. By the end you have a general feed-forward network.
Backpropagation and Autograd: Backpropagation is the algorithm that lets a network learn: it computes the gradient of the loss with respect to every weight by applying the chain rule backward through the layers. Build it from the chain rule up, through a neuron, through a full network, then build a tiny autograd engine, and finally train the XOR network to learn its own weights.
Training Neural Networks: A network learns only as well as the procedure that updates its weights. Build the loss functions that score predictions, the optimizers that turn gradients into weight updates (SGD, momentum, RMSprop, Adam), and the training loop that ties them together over mini-batches and epochs. By the end you train a small network with Adam and watch the loss fall.
Deep Classification: Move from fitting numbers to making decisions. Build the softmax classifier and its clean cross-entropy gradient, train a multi-class network on real data, then face the central problem of machine learning: overfitting. Build L2 regularization and dropout to fight it, and the train/validation split and metrics that reveal it. The capstone trains and honestly evaluates a regularized classifier.
Convolutional Networks: Images have spatial structure that dense layers ignore. Convolutional networks exploit it by sliding small learnable filters across the image to build feature maps. Build convolution from scratch, the edge-detecting filters it learns, pooling that summarizes regions, a convolutional layer with its backward pass, and finally a tiny CNN that learns to classify images from raw pixels.
Sequences and Recurrent Networks: Text, audio, and time series are sequences, and their meaning depends on order. Recurrent networks process them one step at a time while carrying a hidden state that remembers the past. Build the RNN cell, unroll it over a sequence, derive backpropagation through time, confront the vanishing-gradient problem, and train a character-level RNN that learns a pattern and generates it.
Tokenization and Embeddings: Before a model can read text it must turn it into numbers. Tokenization splits text into units and maps them to integer ids; embeddings turn those ids into dense, learnable vectors that place similar tokens near each other. Build character, word, and byte tokenizers, the embedding table and its gradient, vector similarity and analogies, and train embeddings that learn meaning from a task.
Attention and Transformers: Attention lets every token look directly at every other token and decide what is relevant, removing the recurrent bottleneck entirely. Build scaled dot-product attention from queries, keys, and values; add causal masking and multiple heads; add positional encoding so order matters; and assemble the transformer block with layer norm, residual connections, and a feed-forward sublayer. This is the architecture behind every modern language model.
Capstone: A Tiny GPT: The final project assembles everything into a working GPT: a character-level transformer language model you train end to end with your own backpropagation, then sample from to generate text. Build the full forward pass, derive the backward pass piece by piece (loss, feed-forward, attention), train it to model a tiny corpus, and watch it generate the text it learned. This is a real GPT, only small.

Key concepts

Activation function: A nonlinearity (ReLU, sigmoid, tanh) applied to a neuron's output, letting networks model nonlinear relationships.
Attention: A mechanism that lets a model weigh which other positions in a sequence to focus on; the heart of transformers.
Backpropagation: The algorithm that computes the gradient of the loss with respect to every weight by applying the chain rule backward through the network.
Batch: A group of examples processed together before one parameter update. Batches make training faster and produce a gradient estimate that is noisy enough to be use…
Convolution (CNN): Sliding a small kernel over an image to detect local patterns; the core operation of convolutional networks.
Cross-entropy: A classification loss that penalizes low probability assigned to the correct class. It is commonly paired with softmax because softmax turns raw scores into cl…
Embedding: A learned vector representation of a discrete item such as a word, token, user, or category. Similar items tend to end up near each other, which lets neural ne…
Epoch: One full pass through the training dataset. Multiple epochs let the model refine its parameters, but too many can cause overfitting if validation performance s…
Gradient descent: Updating weights in the direction that most reduces the loss, scaled by a learning rate.
Learning rate: The step size used when updating parameters from gradients. Too small trains slowly; too large can overshoot, diverge, or bounce around the minimum.
Loss function: A measure of prediction error (e.g., cross-entropy, MSE) that training minimizes.
Neuron: A unit that computes a weighted sum of inputs plus a bias, then applies an activation function; networks stack many.
Overfitting: When a model learns the training examples too specifically and performs worse on new data. It often appears as training loss improving while validation loss st…
Parameter: A learnable number inside a model, such as a weight or bias. Training changes parameters; hyperparameters like learning rate or batch size are chosen outside t…
Random seed: A value used to initialize pseudorandom choices such as weight initialization, shuffling, or sampling. Setting a seed makes experiments easier to reproduce, th…
Recurrent network (RNN): A network with a hidden state carried across a sequence, letting it model order and context.
Softmax: Turns a vector of scores into a probability distribution (positive, summing to 1), used for classification outputs.
Tensor: A multidimensional numeric array used to hold inputs, activations, parameters, gradients, and batches. A vector of token ids, a batch of images, and a weight m…
Tokenization: Breaking text into model-readable pieces called tokens. In language models, tokenization defines the vocabulary, sequence length, and the units that embeddings…
Transformer: An architecture built on self-attention (no recurrence) that powers modern language models like GPT.
Weights and bias: The learnable parameters of a network; training adjusts them to reduce the loss.