What attention really is, explained by coding it
A new language model launches roughly every two days right now. They have different names and different benchmarks, but under the hood they are almost all the same architecture: a transformer. And the one operation that makes a transformer a transformer is attention. Learn it once and the entire zoo of models stops being magic.
The good news is that attention is small. Not simplified-small, actually-small. You can write the real operation in about fifteen lines of numpy, and it is the same math running inside the model that wrote your last autocomplete.
The one idea: a soft lookup
Think of a Python dictionary. You have a key, you look it up, you get the matching value. It is a hard lookup: your key either matches or it doesn't.
Attention is a soft lookup. Every item offers a key and a value. You show up with a query. Instead of taking the one value whose key matches, you take a blend of all the values, weighted by how well each key matches your query. A key that matches a lot contributes a lot; a key that barely matches contributes a little.
That is the whole concept. Everything else is how we measure "matches" and how we turn those match scores into weights that sum to 1.
Measuring a match: the dot product
For vectors, the simplest "how aligned are these" score is the dot product. Two vectors pointing the same way score high; perpendicular vectors score near zero. So the match between a query and a key is just query · key.
If we have several queries and several keys, every query-key match is one big matrix multiply:
import numpy as np
scores = Q @ K.T # shape (num_queries, num_keys)
Row i of scores is how well query i matches each key.
Turning scores into weights: softmax
Raw scores are not weights; they can be negative or huge. We want, for each query, a set of non-negative weights over the keys that sum to 1. That is exactly what softmax does: exponentiate, then normalize.
def softmax(x):
x = x - x.max(axis=-1, keepdims=True) # subtract max for numerical stability
e = np.exp(x)
return e / e.sum(axis=-1, keepdims=True)
Two details that matter:
- Subtracting the max changes nothing mathematically (it cancels in the ratio) but stops
expfrom overflowing on large scores. Skip it and big logits give youinf. This single line is the difference between "works" and "NaN." - We normalize along the last axis, so each query's weights over the keys sum to 1.
The whole operation
Now combine the two halves and use the weights to blend the values. There is one extra detail, the scaling factor, which we will justify in a second:
def attention(Q, K, V):
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k) # scaled dot-product
weights = softmax(scores) # each row sums to 1
return weights @ V # blend the values
That is it. That is "scaled dot-product attention," the exact operation named in the 2017 paper that started all of this. Let us see it move:
K = np.array([[1, 0], [0, 1], [1, 1]], dtype=float) # 3 keys
V = np.array([[10, 0], [0, 10], [5, 5]], dtype=float) # their values
Q = np.array([[1, 0]], dtype=float) # one query, aligned with key 0
print(attention(Q, K, V))
The query points along the first key's direction, so attention leans toward the first value and you get a blend dominated by [10, 0]. Change the query to [0, 1] and the output swings toward [0, 10]. The query is steering a weighted average over the values. That steering, learned and stacked, is most of what a transformer does.
Why divide by √d_k
As the vector dimension grows, dot products grow too, just from summing more terms. Large scores push softmax into a corner where one weight is ~1 and the rest are ~0, and its gradient vanishes, so learning stalls. Dividing by √d_k keeps the scores in a sane range regardless of dimension. It is a small constant with a big effect on whether the thing trains at all.
Self-attention: where Q, K, and V come from
So far Q, K, and V fell from the sky. In a real transformer they all come from the same input sequence, each through its own learned projection. That is "self-attention": every position looks at every other position in the same sequence and pulls in what is relevant.
def self_attention(X, Wq, Wk, Wv):
Q = X @ Wq
K = X @ Wk
V = X @ Wv
return attention(Q, K, V)
X = np.random.randn(4, 8) # 4 tokens, each an 8-dim vector
Wq = np.random.randn(8, 8) # learned projections (random here)
Wk = np.random.randn(8, 8)
Wv = np.random.randn(8, 8)
print(self_attention(X, Wq, Wk, Wv).shape) # (4, 8): one new vector per token
Each token comes out as a blend of all the tokens, weighted by learned relevance. In a trained model those W matrices have learned that, say, a pronoun should attend to the noun it refers to. The output for "it" becomes mostly the value of "the cat." That is how context flows between words.
From here to a real model
Stack a few honest extensions on this core and you have a transformer:
- Multiple heads: run several attentions in parallel with different projections, so the model can attend to different kinds of relationships at once, then concatenate.
- A feed-forward layer after attention, and residual connections plus layer norm around both.
- Positional information, because plain attention has no sense of order, so position has to be added in.
- A causal mask for text generation, so a token can only attend to earlier tokens.
Every one of those is a small addition to the fifteen lines above. The headline-grabbing models are this operation, scaled up and stacked deep, trained on a lot of text. If you want to build the rest of it, the multi-head version, the mask, the full block, that is the path through the NLP from scratch track on IWTLP, where you build the engine instead of importing it.