warming up your workspace

Build your own JSON parser, and meet recursive descent

You call json.loads constantly and never think about it. But turning the text {"x": [1, 2, 3]} into a Python dict containing a list is a small act of translation, and the technique behind it, recursive descent, is the same one that parses real programming languages. JSON is the perfect first parser to write precisely because the grammar is tiny but complete: it has nesting, recursion, and every category of token a parser deals with.

By the end you will have a parse() that turns JSON text into Python objects, in about 60 lines, and you will understand the shape of every recursive-descent parser you ever read after.

The one idea: the grammar is recursive, so the parser is too

JSON is defined recursively. A value is one of: an object, an array, a string, a number, or true/false/null. And an object's values are themselves values, and an array's elements are values. The definition refers to itself.

When the grammar is recursive, the parser can be too: one function per kind of thing, and they call each other. That direct mirroring, grammar rule → function, is what "recursive descent" means, and it is why this style is so readable.

The setup: a cursor over the text

We walk the string with a single moving index, i. Every parsing function reads from i and leaves it pointing just past whatever it consumed.

class Parser:
    def __init__(self, text):
        self.s = text
        self.i = 0

    def parse(self):
        self.ws()
        value = self.value()
        self.ws()
        if self.i != len(self.s):
            raise ValueError(f"trailing data at position {self.i}")
        return value

    def ws(self):                       # skip whitespace
        while self.i < len(self.s) and self.s[self.i] in " \t\n\r":
            self.i += 1

parse is the whole job in miniature: skip space, read one value, skip space, and insist nothing is left over. That last check is what makes {} garbage an error instead of a silent success.

The dispatcher: one look-ahead character

The heart of recursive descent is choosing which rule applies by peeking at the next character. JSON makes this easy, every value type starts with a distinct character:

    def value(self):
        c = self.s[self.i]
        if c == '{': return self.object()
        if c == '[': return self.array()
        if c == '"': return self.string()
        if c == '-' or c.isdigit(): return self.number()
        if self.s.startswith("true",  self.i): self.i += 4; return True
        if self.s.startswith("false", self.i): self.i += 5; return False
        if self.s.startswith("null",  self.i): self.i += 4; return None
        raise ValueError(f"unexpected {c!r} at position {self.i}")

One character of look-ahead tells us exactly which function to call. This property, that the next token determines the rule, is what makes a grammar easy to parse this way.

The recursive cases: objects and arrays

These are where the parser calls back into value, which is what lets JSON nest arbitrarily deep.

    def object(self):
        self.i += 1                      # consume '{'
        obj = {}
        self.ws()
        if self.s[self.i] == '}':        # empty object
            self.i += 1
            return obj
        while True:
            self.ws()
            key = self.string()          # keys are strings
            self.ws()
            if self.s[self.i] != ':':
                raise ValueError("expected ':'")
            self.i += 1
            self.ws()
            obj[key] = self.value()      # <-- recursion: a value can be anything
            self.ws()
            c = self.s[self.i]; self.i += 1
            if c == '}': return obj
            if c != ',': raise ValueError("expected ',' or '}'")

    def array(self):
        self.i += 1                      # consume '['
        arr = []
        self.ws()
        if self.s[self.i] == ']':
            self.i += 1
            return arr
        while True:
            self.ws()
            arr.append(self.value())     # <-- recursion again
            self.ws()
            c = self.s[self.i]; self.i += 1
            if c == ']': return arr
            if c != ',': raise ValueError("expected ',' or ']'")

The self.value() calls are the whole trick. An array element can be another array, an object value can be another object, and the recursion handles any depth without any extra code.

The leaves: strings and numbers

    def string(self):
        self.i += 1                      # opening quote
        out = []
        escapes = {'"': '"', '\\': '\\', '/': '/',
                   'n': '\n', 't': '\t', 'r': '\r', 'b': '\b', 'f': '\f'}
        while self.s[self.i] != '"':
            c = self.s[self.i]
            if c == '\\':
                self.i += 1
                out.append(escapes[self.s[self.i]])
            else:
                out.append(c)
            self.i += 1
        self.i += 1                      # closing quote
        return "".join(out)

    def number(self):
        start = self.i
        if self.s[self.i] == '-':
            self.i += 1
        while self.i < len(self.s) and self.s[self.i] in "0123456789.eE+-":
            self.i += 1
        text = self.s[start:self.i]
        return float(text) if any(ch in text for ch in ".eE") else int(text)

Run it

data = Parser('{"name": "iwtlp", "nums": [1, 2.5, -3], "ok": true, "x": null}').parse()
print(data)
# {'name': 'iwtlp', 'nums': [1, 2.5, -3], 'ok': True, 'x': None}
print(data["nums"][1])   # 2.5

Real Python objects, out of raw text, from code you can read top to bottom.

What you just built, and what's missing

This is a real recursive-descent parser. To be fully standards-compliant you would add: \uXXXX unicode escapes in strings, a stricter number grammar (JSON forbids leading zeros and a bare leading .), and friendlier errors with line and column numbers. Those are refinements, not new ideas, the structure stays exactly as above.

And that structure is the payoff. Swap the grammar and the same skeleton parses arithmetic expressions, a config format, or a small programming language: one function per rule, one character (or token) of look-ahead to choose, recursion for nesting. JSON was just the gentlest possible version of the job.

If you want to push this all the way to parsing a real language, expressions with precedence, statements, then evaluating them, that is exactly what the compilers track builds, starting from a parser that looks a lot like this one.