warming up your workspace

Compilers & Interpreters

Build your own programming language from scratch in Python: a scanner that breaks source into tokens, a Pratt parser that grows an abstract syntax tree, a tree-walking interpreter for expressions, variables, control flow and closures, a resolver and honest error reporting, and a bytecode compiler with its own stack VM, capped by a complete mini-language you can actually run. The machinery inside every interpreter, one piece at a time.

10 projects, 250 hands-on levels, run in your browser.

Syllabus

  • Scanning: Every compiler starts by reading raw source text and chopping it into tokens, the words and symbols the rest of the pipeline works with. This project builds a scanner (also called a lexer) from the ground up: classifying characters, walking the source with a cursor, recognizing number literals, the one- and two-character operators, and the identifiers and keywords, then skipping the whitespace and comments that carry no meaning. By the end you will have a full tokenize() that turns a line of source into a clean list of tokens, each a simple (TYPE, value) tuple the parser can consume.
  • A Calculator: With a stream of tokens in hand, the next job is to understand their structure. This project builds an arithmetic calculator the way a real parser does: a cursor that walks the token list, a recursive-descent grammar that encodes precedence (multiplication binds tighter than addition) and left-associativity, the elegant Pratt technique that drives the same result from a table of binding powers, parentheses that override the defaults, and finally a tree-walking evaluator that turns the syntax tree into a number. The result is a working calculator you can read source into and get the right answer out of, the smallest complete language pipeline.
  • Parsing Expressions: A calculator only knows numbers; a language knows strings, booleans, nil, and variables too, joined by comparison, equality, and logical operators. This project widens the parser to the full expression grammar. You build the constructors for every kind of AST node, extend the precedence-climbing parser with a complete table of binding powers (so `and` binds looser than `==`, which binds looser than `+`), give `and` and `or` their own node type for the short-circuit semantics to come, and write an s-expression printer that turns any tree back into readable text like `(+ 1 (* 2 3))`. By the end the parser accepts everything an expression can be, ready for the evaluator.
  • Evaluating Expressions: A parse tree is only a description; to run a program you must walk that tree and compute what it means. This project builds the tree-walking evaluator at the heart of every interpreter. Literals evaluate to their own values, arithmetic combines numbers while `+` does double duty as string concatenation, comparison and equality produce booleans, and the logical `and`/`or` short-circuit according to the language's own definition of truth (only `nil` and `false` are falsey). Finally you handle the operations that cannot work, dividing by zero or adding a number to a string, by raising clear runtime errors instead of crashing. The result evaluates any expression the parser can build.
  • Statements & State: An expression produces a value; a statement makes something happen. This project turns the evaluator into an interpreter of programs. You build statement nodes (print and bare expressions), the environment that stores variables as a chain of scopes, variable declaration and assignment that read and write it, and the block scoping that lets an inner variable shadow an outer one and vanish when its block ends. The environment is modeled as a list of dictionaries, innermost scope first, so looking up a name walks outward until it is found. With state, the language can finally remember things between statements.
  • Control Flow: A language that can only run statements top to bottom is barely a language. Control flow gives it branches and loops. This project builds if/else that chooses a path by the condition's truthiness, while loops that repeat until their condition fails, and for loops, which you implement not as a new construct but by desugaring them into an initializer, a while, and an increment. You wire the logical operators into conditions, and finally add break and continue, which jump out of or skip ahead in a loop using control-flow signals (Python exceptions used as a clean non-local jump). With these, the interpreter can express any algorithm.
  • Functions & Closures: Functions turn a script into a language you can build with. This project adds them in full. A call expression names a callee and its arguments and checks the arity; native functions let the interpreter reach into the host language for primitives like clock and sqrt; a function declaration binds a function VALUE that remembers its parameters, body, and the environment it was born in. The return statement unwinds the call with a control-flow signal, defaulting to nil when omitted. Finally closures fall out for free: because a function captures its defining environment, it can read variables from where it was created, which makes recursion and counter-makers just work.
  • Resolving, Scope & Errors: A robust interpreter does two extra jobs around evaluation. First, a static resolution pass walks the program before it runs and pins each variable reference to the exact scope it refers to, measured as a depth; this is what makes closures capture the right variable and fixes a famous bug where a loop variable leaks into every closure. Second, it reports errors well: static errors like redeclaring a variable or returning outside a function are caught before running, runtime errors carry a line number and a clear message, and when the parser hits a syntax error it enters panic mode and synchronizes to the next statement boundary so it can keep going and report more than one problem at a time.
  • A Bytecode VM: Tree-walking is simple but slow; production interpreters compile to bytecode and run it on a virtual machine. This project builds that second backend. A chunk bundles a flat list of instructions with a pool of constants; instructions are tiny tuples like ('CONST', 0) or ('ADD',). You build the value stack the VM computes on, the dispatch loop that executes each opcode (constants push, arithmetic pops two and pushes one, negate flips the top), the compiler that walks the AST in post-order so operands are on the stack before their operator runs, and finally jumps, the instructions that move the instruction pointer to implement branches and loops. The same programs from the tree-walker now run on a machine you built.
  • Capstone: A Complete Mini-Language: Nine projects of parts, one language. This capstone wires the scanner, parser, and evaluator into a single working interpreter and exposes one function, run(source), that takes a program as text and returns the list of values it printed. The language has numbers, strings, booleans and nil, variables and assignment, arithmetic and logic, if/else and while, and functions with parameters, return, recursion, and closures. You assemble it from everything you built, then run real little programs, a counter, a summation loop, a recursive factorial, through your own machine and watch the right answers come out. The interpreter is small, but it is complete, and it is yours.

Key concepts

  • Abstract syntax tree (AST): A tree representation of a program's structure, where each node is a construct and its children are its parts. 'Abstract' because it drops surface…
  • Arity: The number of parameters a function declares. Calling it with the wrong number of arguments is an error.
  • Assignment: An expression that updates an EXISTING variable wherever it is defined, returning the assigned value. Unlike declaration it never creates a variable; assigning…
  • Associativity: How operators of equal precedence group: left-associative '1 - 2 - 1' is '(1 - 2) - 1'. In a parser, looping at a level gives left-associativit…
  • AST node: One element of the syntax tree, here a tagged tuple like ('num', 5) or ('binary', '+', left, right). The tag tells the evaluator how to…
  • Backpatching: Emitting a forward jump with a placeholder target, then filling in the real target once the destination is known (after the body is compiled). Needed because y…
  • Binary node: An AST node for an operation on two sub-expressions, ('binary', op, left, right). The evaluator computes both children, then applies the operator.
  • Binding power: A number Pratt parsing assigns each operator: higher binds tighter. The parser keeps folding operators while their power meets the current minimum, which yield…
  • Block: A sequence of statements wrapped in braces that runs in its own nested scope. Entering pushes a scope, leaving pops it.
  • Break & continue: Loop controls: break exits the innermost loop, continue skips to its next iteration. Implemented with control-flow signals the loop catches.
  • Bytecode: A flat sequence of simple instructions a compiler emits from the AST, run by a virtual machine. Faster than tree-walking because the per-node overhead is paid…
  • Bytecode VM: A virtual machine that executes bytecode on a value stack, reading instructions in a dispatch loop driven by an instruction pointer. The second backend, alongs…
  • Call expression: Invoking a function with arguments: ('call', callee, args). The evaluator computes the callee to a function value, evaluates the arguments, and runs th…
  • Call stack: The chain of in-progress function calls, each with its own scope. A return pops a frame; recursion deepens the stack until a base case returns.
  • Chunk: A unit of compiled code: a list of instructions plus a pool of constant values the instructions index into.
  • Closure: A function together with the environment it was defined in, so it can read those variables when called later, even elsewhere. Closures make recursion, higher-o…
  • Comment: Text the scanner skips because it carries no meaning to the program, such as everything after // to the end of a line. Comments produce no tokens.
  • Compiler: A program that translates source code into another form, machine code, bytecode, or another language, ahead of execution. The front end (scanning, parsing) is…
  • Constant pool: The list of literal values a chunk holds; a CONST instruction carries an index into it rather than the value itself, keeping instructions small.
  • Control flow: Constructs that choose or repeat what runs: if/else, while, for, and break/continue. They direct the order of execution instead of running statements straight…
  • Control-flow signal: An exception used as a clean non-local jump to implement break, continue, and return: the signal unwinds the call or loop to wherever it is caught, carrying a…
  • Declaration: A statement that brings a name into existence and binds it, like 'var x = 5;'. It always writes the current (innermost) scope.
  • Desugaring: Rewriting a convenient construct into more primitive ones the interpreter already supports. A for loop desugars to an initializer, a while, and an increment, s…
  • Environment: Where variables live: a chain of scopes, modeled as a list of dictionaries with the innermost first. Lookup walks outward; define writes the front.
  • EOF token: A sentinel token marking the end of input, so the parser can check for the end without reading past the token list. Conventionally ('EOF', None).
  • Evaluator: The component that computes a program's result by walking the AST: literals return themselves, operators combine their evaluated children. Also called the…
  • Expression: A construct that produces a value: a literal, an arithmetic operation, a comparison, a call. Statements often contain expressions.
  • Front end & back end: The front end (scanner + parser) figures out what a program SAYS, producing an AST; the back end (evaluator or code generator) decides what it DOES. The AST is…
  • Function value: A first-class runtime value bundling a function's parameters, body, and the environment it was defined in (its closure). Calling it binds the arguments in…
  • Grammar: The formal rules describing which token sequences are valid programs and how they nest. The parser is a direct encoding of the grammar.
  • Grouping: Parentheses that override the default precedence, forcing an inner expression to be parsed as a unit. They steer parsing but leave no node in the tree (the inn…
  • Identifier: A user-chosen name for a variable or function: a letter or underscore followed by letters, digits, or underscores. Distinguished from keywords only after the w…
  • Instruction pointer: The index of the next instruction the VM will run. Normal execution advances it by one; a jump sets it to a target, implementing control flow.
  • Interpreter: A program that executes another program directly, reading its source (or an intermediate form) and carrying out its meaning, rather than translating it to mach…
  • Jump: An instruction that moves the instruction pointer to a target, implementing branches and loops. A conditional jump (JUMP_IF_FALSE) pops a value and jumps only…
  • Keyword: A reserved word the language gives special meaning (if, while, return, var). The scanner reads a word, then checks it against the keyword set; anything else is…
  • Lexeme: The actual run of characters a token came from, for example the text "123" or "while". The token pairs the lexeme with a classified type.
  • Lexical scope: Scope determined by where code is WRITTEN, not where it runs. A name resolves using the scopes surrounding its definition, which is what makes closures capture…
  • Literal: A value written directly in source: a number, string, boolean, or nil. Its AST node carries the value and evaluates to itself.
  • Logical node: A distinct AST node for and/or, ('logical', op, left, right), kept separate from binary because it short-circuits (evaluates lazily) rather than always…
  • Maximal munch: The scanner's rule of consuming the longest valid lexeme: it reads a whole identifier before deciding if it is a keyword, and prefers the two-character ==…
  • Native function: A built-in implemented in the host language rather than the interpreted one, exposed as a callable value. The interpreter calls it by passing the evaluated arg…
  • Opcode: The operation an instruction performs (CONST, ADD, NEG, JUMP). The VM's dispatch loop branches on it to do the right thing to the stack.
  • Operator overloading: Giving one operator more than one meaning by operand type: here + adds two numbers but concatenates two strings. The evaluator dispatches on the operands.
  • Panic-mode recovery: After a syntax error the parser cannot trust the next tokens, so it discards them up to a statement boundary (synchronizing) and resumes, letting it report sev…
  • Parser: The stage that turns a flat token stream into a tree (the AST) reflecting the program's grammatical structure, enforcing precedence, nesting, and the rules…
  • Post-order emission: Compiling a node by compiling its children first, then the operator, so both operands sit on the stack before the operation runs. The reason bytecode for '…
  • Pratt parsing: A compact parsing technique that drives the whole expression grammar from a table of operator binding powers in a single loop, instead of one function per prec…
  • Precedence: Which operators bind tighter than others (so * groups before +). In the AST, higher-precedence operations sit deeper and are evaluated first.
  • Production rule: One rule of a grammar, like 'term -> factor ((' ' | '/') factor) '. In recursive descent each production becomes a parsing function.
  • Recursion: A function calling itself. It works because a declared function's name is bound in the scope it captured, so the body can resolve its own name, no special…
  • Recursive descent: A top-down parsing technique with one function per grammar rule that call each other recursively. Precedence emerges because lower-precedence rules call higher…
  • REPL: A read-eval-print loop: an interactive prompt that reads a line, runs it through the interpreter, and prints the result. The smallest useful front-end for a la…
  • Resolver: A static pass that runs before execution and pins each variable reference to its scope depth, fixing how many scopes outward to look. It closes a famous closur…
  • Return: A statement that hands a value back to a function's caller and stops the function immediately. Implemented with a control-flow signal that unwinds the call…
  • Runtime error: A failure detected while running, such as dividing by zero or adding a number to a string. A good interpreter reports it with a line and a message instead of c…
  • Scanner (lexer): The first stage: it reads raw source character by character and groups it into tokens, the words and symbols of the language. Also called a lexer or tokenizer.
  • Scope: The region of a program where a name is visible. A block opens a new scope; variables declared inside it vanish when it ends.
  • Scope depth: The distance from a use of a variable to the scope that defines it: 0 for the innermost, higher for outer scopes, -1 (global) when no local scope has it. The r…
  • Shadowing: When a variable declared in an inner scope hides an outer one of the same name. Lookup finds the inner first; the outer reappears when the inner scope ends.
  • Short-circuit evaluation: The logical operators and/or evaluate their right operand only when the left does not already settle the result, so errors on the unreached side are skipped. T…
  • Stack machine: A VM that computes on a value stack: instructions push operands and pop them. A binary op pops two values and pushes one result.
  • Statement: A construct executed for its effect rather than its value, like a print, a variable declaration, an if, or a while. A program is a list of statements.
  • Static error: A mistake caught before the program runs, by the parser or resolver: redeclaring a variable in a scope, using an undeclared name, or returning outside a functi…
  • Synchronize: The recovery step that skips tokens until a safe restart point, typically just past the next semicolon or at the start of a statement keyword, so parsing can c…
  • Syntactic sugar: Syntax that adds convenience but no new power, because it can be expressed in terms of existing constructs. Removing it (desugaring) keeps the core interpreter…
  • Token: The smallest meaningful unit the scanner produces: a tuple of a TYPE (like NUMBER or PLUS) and a value. The parser consumes a stream of these.
  • Token type: The category a token belongs to (NUMBER, IDENT, PLUS, LPAREN, EOF), the thing the parser branches on. Value carries the literal data.
  • Tokenize: To run the scanner over a whole source string, producing the full list of tokens (usually ending in an EOF sentinel) that the parser will read.
  • Tree-walking interpreter: An interpreter that executes a program by directly recursing over its AST, evaluating each node every time the program runs. Simple to write; slower than a byt…
  • Truthiness: A language's rule for which values count as true in a condition. In this language only nil and false are falsey; everything else, including 0 and the empty…
  • Unary node: An AST node for a one-operand operation, like negation -x or logical not !x: ('unary', op, operand).
  • Whitespace: Spaces, tabs, and newlines that separate tokens but produce none. The scanner skips it between tokens.