Why AI is getting smaller, not just bigger

Most AI headlines are about models getting bigger. But June 2026 had a louder, quieter trend running the other way: small models that run on your own machine. Microsoft shipped Aion-1.0 to run in the browser on devices without a GPU, and a 5-billion-parameter coding model, MAI-Code-1-Flash, pitched as comparable to a much larger model but cheaper. The interesting frontier right now is not "how big," it is "how small can we go and still be useful."

This is worth understanding because it changes where AI runs and what it costs. A model on your laptop has no per-token bill, no network latency, and your data never leaves the device. The question is how a small model can be good enough, and what you give up.

The one idea: most of a big model is overkill for most tasks

A frontier model is a generalist that can write poetry, debug Rust, and explain tax law. Most real tasks need a sliver of that. A small language model (SLM) is the bet that a few billion parameters, tuned for a narrower job, can match the giant on that job while being cheap enough to run anywhere.

Two techniques make small models punch above their weight: quantization (make each number smaller) and distillation (train a small model to imitate a big one). The first is the one with a satisfying, concrete explanation, so let us build the intuition.

Quantization: shrink the numbers

A model is billions of numbers (weights). Trained, they are usually 32-bit or 16-bit floats. A 7-billion-parameter model at 16 bits is about 14 GB, which is why it needs a serious GPU. Quantization stores each weight in fewer bits, often 8 or even 4, which shrinks the model and speeds it up.

The core move is mapping a range of floats onto a small set of integers. Here is 8-bit quantization in a few lines:

import numpy as np

def quantize(weights, bits=8):
    qmax = 2 ** (bits - 1) - 1          # 127 for int8
    scale = np.abs(weights).max() / qmax
    q = np.round(weights / scale).astype(np.int8)
    return q, scale

def dequantize(q, scale):
    return q.astype(np.float32) * scale

w = np.array([0.12, -0.85, 0.33, -0.04, 0.91], dtype=np.float32)
q, scale = quantize(w)
print(q)                 # small integers
print(dequantize(q, scale))  # close to the originals, not exact

Three details that matter:

You store integers plus one scale. The weights become int8 (a quarter the size of float32), and you keep a single float per group to map them back. That is most of the size win.
It is lossy. dequantize(quantize(w)) is close to w, not equal. Quantization trades a little numerical precision for a lot of size and speed. The art is doing it where the model barely notices.
Less memory is also less bandwidth. Inference is often bottlenecked on moving weights from memory, not on math. Smaller weights move faster, so quantization speeds things up even when the math is the same.

A 7B model at 4-bit is roughly 3.5 GB instead of 14 GB, small enough to sit in a laptop's RAM. That single change is much of what "runs on-device" means.

Distillation: a small model copies a big one

The other half is training. Instead of training a small model only on raw text, you train it to match the outputs of a large teacher model. The student learns not just the right answer but the teacher's whole probability distribution over answers, which carries far more signal than a single label. The result is a small model that behaves, on the target tasks, much closer to its teacher than its size suggests. MAI-Code-1-Flash being "comparable to a bigger model" on coding is this idea applied to one domain.

The trade-off you are actually making

Small and on-device is not free. What you give up:

Breadth. A task-tuned 5B model is excellent at its job and worse at everything else. The generalist frontier model still wins on the long tail of unusual requests.
Peak quality. On the hardest reasoning, the big model is still ahead. SLMs win on routine tasks where "good and instant and free" beats "slightly better but slower and metered."
Some accuracy to quantization. Push the bits too low and quality degrades; there is a floor.

What you gain is often what production actually cares about: latency (no round-trip to a data center), privacy (data stays on the device), cost (no per-token bill), and resilience (works offline). For autocomplete, classification, summarization, extraction, the bread-and-butter of real apps, that bundle frequently wins.

Why this is the more important trend

"Bigger model" makes headlines; "small model on every device" changes the economics. If a good-enough model runs free on the hardware you already own, you stop sending every keystroke to an API. The likely future is not one giant model but a fleet: small specialists on-device for the common case, escalating to a big model only for the hard tail. Knowing which one a task needs, and why a 4-bit 5B model can be the right answer, is becoming a core engineering judgment.

And the mechanism is not mysterious once you have quantized a few weights yourself: it is integers and a scale factor, plus a student copying a teacher. If you want to understand models from the numbers up rather than from the marketing down, that is the whole approach of the AI track.