Natural Language Processing

Teach a program to read. Build NLP from the ground up in Python and numpy: tokenizers and vocabularies, bag-of-words and TF-IDF search, n-gram language models that generate text, Naive Bayes and logistic-regression classifiers, edit distance and a spell-corrector, hidden Markov models with Viterbi for tagging, word embeddings from co-occurrence and SVD, dot-product attention, and a mini language-model capstone. No nltk or spaCy, you write the math, so you understand what every library is really doing.

11 projects, 275 hands-on levels, run in your browser.

Syllabus

Foundations: code through language: Never written code before? Start here. You will learn the basics of Python, output, variables, types, decisions, loops, and functions, through words, tokens, and text. By the end you are ready for Project 1.
Text Foundations: Before a model can learn from language it has to be turned into something countable. This project builds the front end of every NLP system: splitting raw text into tokens, normalizing them, building a vocabulary and mapping words to integer ids, forming n-grams, and measuring basic statistics of a corpus. Pure Python string work, but it is the ground every later project stands on.
Counting & Vectors: Once text is tokens, the first model is just counting. This project turns counts into the vectors every classical NLP method uses: word frequencies and the Zipf pattern they follow, bag-of-words vectors over a fixed vocabulary, term-frequency weighting, and similarity between documents. It ends by moving the same operations into numpy, the representation the rest of the track builds on.
TF-IDF & Search: Common words like 'the' are frequent but useless for telling documents apart. TF-IDF fixes this by weighting each word's frequency against how rare it is across the corpus, and it powers classic search and retrieval. This project builds inverse document frequency, the TF-IDF weighting that combines it with term frequency, cosine ranking of documents against a query, a small working search engine, and the vectorized numpy version.
N-gram Language Models: A language model assigns a probability to a sequence of words, and the n-gram model is the classic one: the probability of a word given the few that came before. This project builds bigram probabilities by maximum likelihood, the add-k smoothing that handles unseen pairs, the chain rule for whole-sentence probability, perplexity as the standard quality measure, and text generation by sampling the next word, the same machinery that scaled up becomes a modern language model.
Text Classification: Sorting text into categories, spam or not, positive or negative, is the workhorse task of applied NLP. This project builds two classic classifiers from scratch: multinomial Naive Bayes, with its class priors and smoothed word likelihoods scored in log space, and logistic regression with the sigmoid and a gradient step. It also builds the metrics that tell you whether a classifier is any good, accuracy, precision, recall, and F1, and wires them into a small sentiment classifier.
Edit Distance & Spelling: How different are two strings, and what did the user probably mean to type? This project builds edit distance with dynamic programming, the longest common subsequence and a similarity ratio, the four edit operations (insert, delete, replace, transpose) that generate spelling candidates, a frequency-based spell corrector in the style of Norvig's, and fuzzy matching that snaps a misspelling to the nearest known word.
Sequence Labeling with HMMs: Many NLP tasks label each word in a sentence: part of speech, named-entity type. The Hidden Markov Model is the classic tool, modeling a hidden chain of tags that each emit a word. This project builds an HMM from counts (transition, emission, and initial distributions), the forward algorithm that sums over all tag sequences to score a sentence, the Viterbi algorithm that finds the single best tag sequence, its application to part-of-speech tagging, and how to evaluate the result, all in numpy.
Word Embeddings: Words that appear in similar contexts have similar meanings, so a word can be represented by the company it keeps. This project builds embeddings from first principles: the co-occurrence matrix counting which words appear near which, the positive pointwise mutual information that turns counts into association strengths, dimensionality reduction with SVD to get dense vectors, cosine similarity between them, and the nearest-neighbor and analogy queries that made embeddings famous.
Attention: Attention is the idea that powers modern NLP: instead of a fixed window, every position can look at every other and decide what matters. This project builds the mechanism from scratch in numpy, the softmax that turns scores into weights, the dot-product scores between queries and keys, scaled dot-product attention that combines values, self-attention from learned query/key/value projections, and the positional encoding that gives a bag of vectors a sense of order, finishing with a single attention block's forward pass.
Capstone: A Mini NLP Engine: The finale assembles the track into one working system. You will build a preprocessing pipeline that cleans raw text into tokens and a vocabulary, a TF-IDF search that retrieves the most relevant document for a query, a Naive Bayes classifier that labels messages as spam or not, a bigram model that autocompletes the next word, and a dispatcher that routes a request to the right component. By the end you have a small but complete natural-language engine, built from scratch.

Key concepts

Accuracy: The fraction of predictions that are correct. Easy to read but misleading on imbalanced data, where always guessing the majority class scores high.
Attention: A mechanism where each position attends to every other by scoring queries against keys, softmaxing the scores into weights, and returning a weighted average of…
Bag of words: Representing a document as a vector of word counts over a vocabulary, discarding word order. Simple but a strong baseline for classification and search.
Bigram: An n-gram of size two: a pair of adjacent tokens. Bigram counts power the simplest language models.
Co-occurrence matrix: A matrix counting how often each word appears near each other word within a window. Its rows are raw context vectors, the starting point for embeddings.
Confusion matrix: A table of true positives, false positives, false negatives, and true negatives, from which precision, recall, and accuracy are computed.
Corpus: A body of text used to train or test a model, often a large collection of documents. The plural is corpora.
Cosine similarity: The cosine of the angle between two vectors, dot product over the product of norms. It measures direction, not magnitude, so it compares documents regardless o…
Cross-entropy: The average number of bits the model needs to encode each word of held-out text, the negative average log2 probability. Perplexity is 2 to this.
Distributional hypothesis: The idea that a word's meaning is captured by the company it keeps: words appearing in similar contexts have similar meanings. The foundation of embeddings.
Document: One unit of text in a corpus: a sentence, a review, an email, an article. Documents are what classifiers label and search engines retrieve.
Document-term matrix: A matrix whose rows are documents and columns are vocabulary words, each cell a count (or weight). The standard input to classical NLP methods.
Dynamic programming: Solving a problem by filling a table of subproblem answers and reusing them, the technique behind edit distance, LCS, and Viterbi.
Edit distance: The minimum number of single-character edits to turn one string into another. The Levenshtein version allows insertions, deletions, and substitutions.
Emission probability: In an HMM, the probability of a hidden state producing an observed word, P(word | tag). The rows of the emission matrix sum to one.
F1 score: The harmonic mean of precision and recall, 2PR / (P + R), a single number that balances the two and is robust to class imbalance.
Forward algorithm: An HMM dynamic program that SUMS over all hidden paths to compute the total probability of an observation sequence. Where Viterbi takes a max, forward takes a…
Fuzzy matching: Matching a string to the closest known string by edit distance, with a threshold to reject far-off inputs. Used in name matching and search-as-you-type.
Hapax legomenon: A word that appears exactly once in a corpus. Natural text is full of them, which is one reason vocabularies grow without bound.
Hidden Markov Model: A model of a hidden chain of states (tags) that each emit an observation (a word), defined by transition, emission, and initial probabilities. Decoded with Vit…
Information retrieval: Finding the documents most relevant to a query, classically by encoding both as vectors and ranking by cosine similarity. The basis of search engines.
Inverse document frequency (IDF): The log of the number of documents over how many contain a word. High for rare, discriminating words and near zero for words that appear everywhere.
Jaccard similarity: The size of the intersection over the size of the union of two sets, a simple measure of how much two documents' word sets overlap.
Language model: A model that assigns probabilities to sequences of words, or predicts the next word given the previous ones. From n-grams to transformers, the same idea scaled…
Laplace (add-one) smoothing: Adding one to every count before normalizing, so no probability is ever zero. Add-k generalizes it with a tunable amount.
Layer normalization: Standardizing each vector to zero mean and unit variance before a sub-layer, which stabilizes training. A standard ingredient of transformer blocks.
Lemmatization: Reducing a word to its dictionary form (ran becomes run) using vocabulary and grammar, more accurate than stemming but heavier.
Levenshtein distance: Edit distance allowing insertion, deletion, and substitution, computed by filling a dynamic-programming table. The standard string-similarity measure for spell…
Likelihood: How likely a word (or document) is given a class, P(word | class), estimated from per-class counts with smoothing. Multiplied across a document's words.
Logistic regression: A discriminative classifier that runs a weighted sum of features through the sigmoid to get a probability, trained by gradient descent. A strong text-classific…
Longest common subsequence: The longest sequence of characters appearing in both strings in order but not necessarily contiguously, found by dynamic programming. Basis of a similarity rat…
Maximum likelihood estimate: Estimating a probability directly from counts, for a bigram model, P(b | a) is count(a, b) over count(a). Simple but assigns zero to anything unseen.
N-gram: A contiguous window of n tokens. Unigrams are single words, bigrams are pairs, trigrams are triples, the basic unit of local context.
Naive Bayes: A classifier that scores each class as its prior times the product of word likelihoods, picking the highest. 'Naive' because it assumes words are condi…
Named-entity recognition: Finding and labeling spans of text that name entities, people, places, organizations, a key sequence-labeling task in information extraction.
Natural Language Processing: The field of getting computers to work with human language: tokenizing, classifying, translating, generating, and understanding text. NLP combines linguistics,…
Normalization: Cleaning tokens to reduce noise and vocabulary size: lowercasing, removing punctuation, collapsing whitespace, masking numbers. It makes different surface form…
Out-of-vocabulary (OOV): A token not in the model's vocabulary. OOV words are skipped or mapped to a special unknown token.
Part-of-speech tagging: Labeling each word with its grammatical category (noun, verb, adjective). A standard sequence-labeling task, often solved with an HMM and Viterbi.
Perplexity: The standard quality measure for a language model: the inverse probability of a test set normalized by length, equivalently 2 to the per-word cross-entropy. Lo…
Positional encoding: Vectors added to token embeddings to inject word order, since attention is otherwise order-blind. Classically built from sines and cosines of the position.
Positive PMI: Positive pointwise mutual information: log2(P(a,b) / (P(a)P(b))) clipped at zero. It turns co-occurrence counts into association strengths, downweighting frequ…
Precision: Of the items the classifier called positive, the fraction that actually are: TP / (TP + FP). High precision means few false alarms.
Prior probability: How likely a class is before seeing the document, estimated as the fraction of training documents with that label. The starting point of a Bayes score.
Recall: Of the actual positives, the fraction the classifier found: TP / (TP + FN). High recall means few misses.
Residual connection: Adding a sub-layer's input back to its output (x + f(x)), which helps gradients flow through deep networks. Used around attention and feed-forward layers.
Sampling: Choosing the next word at random in proportion to the model's probabilities, usually by inverse-transform sampling over the cumulative distribution. Gives…
Self-attention: Attention where the queries, keys, and values are all projections of the same sequence, so each position mixes information from all the others. The building bl…
Sentiment analysis: Classifying the emotional polarity of text, positive, negative, or neutral, a common application of text classification.
Sequence labeling: Assigning a label to each token in a sequence, like part-of-speech tags or entity types. Hidden Markov Models are the classic tool.
Sigmoid: The function 1 / (1 + e^-z) that squashes any real number into (0, 1), turning a score into a probability. The output nonlinearity of logistic regression.
Singular value decomposition: A matrix factorization that compresses a large sparse co-occurrence or PPMI matrix into short dense embedding vectors, keeping the directions of greatest varia…
Smoothing: Giving unseen n-grams a small nonzero probability so a single missing pair does not make a whole sentence impossible. Add-one (Laplace) and add-k are the simpl…
Softmax: A function that turns a vector of scores into a probability distribution, exp(x) over the sum of exp(x), computed stably by subtracting the max first. It produ…
Spell correction: Suggesting the intended word for a typo by generating candidates one or two edits away and keeping the most frequent known word.
Stemming: Crudely chopping a word down to a root by stripping suffixes (running becomes runn), so related forms collapse together. Faster but rougher than lemmatization.
Stopword: A very common, low-content word (the, a, is) often removed because it appears everywhere and tells you little about a document's topic.
Term frequency (TF): How often a word occurs in a document, often divided by the document length so long documents do not dominate.
Text classification: Assigning a label to a document, spam or not, positive or negative, the workhorse supervised task of NLP.
Text generation: Running a language model forward to produce text, picking the next word greedily (the most probable) or by sampling from its distribution.
TF-IDF: Term frequency times inverse document frequency: a weighting that scores a word highly only when it is frequent in this document but rare across the corpus. Th…
Token: One unit produced by tokenization, typically a word but sometimes a subword or character. A document becomes a sequence of tokens.
Tokenization: Splitting raw text into tokens, usually words, the first step of every NLP pipeline. Decisions like lowercasing and stripping punctuation happen here.
Transformer: The architecture built from stacked blocks of self-attention and feed-forward layers with residual connections and layer norm. It underlies modern large langua…
Transition probability: In an HMM, the probability of moving from one hidden state to the next, P(tag_t | tag_{t-1}). The rows of the transition matrix sum to one.
Type-token ratio: The number of distinct tokens (types) divided by the total number of tokens, a measure of lexical variety. High means varied vocabulary.
Vector analogy: The famous embedding property: king - man + woman lands near queen. The analogy vector b - a + c, whose nearest word completes 'a is to b as c is to ?'.
Viterbi algorithm: A dynamic program that finds the single most probable hidden sequence in an HMM by keeping the best path into each state at each step and tracing back with bac…
Vocabulary: The set of distinct tokens a model knows, usually mapped to integer ids. Words outside it are out-of-vocabulary (OOV).
Word embedding: A short dense vector representing a word, learned so that words used in similar contexts get similar vectors. Meaning becomes geometry.
Zipf's law: The empirical pattern that a word's frequency is roughly inversely proportional to its frequency rank: a few words are very common and a long tail is rare.