warming up your workspace

Build your own git from scratch, and watch its hashes match the real thing

Almost everyone uses git every day and almost no one knows what it does. We learn the verbs, add, commit, push, pull, and treat the rest as a black box that occasionally eats our work. But the core of git is small. Not small as in simplified, small as in you can rebuild the real thing in about a hundred and fifty lines of Python, and the objects it produces will be byte for byte identical to the ones real git produces.

That last part is the fun bit. By the end of this you will have a program that hashes a file, and git hash-object on the same file prints the same hash. Same trees, same commits. Not a toy that mimics git. Git.

This is the written companion to the video build. It goes a step further at the end and shows how merge works, which the video only teases.

The one idea

Git is a key value store. You hand it some content, it hands you back a hash of that content, and it files the content away under that hash. The key is the hash of the value. That is content addressing, and it is the same idea behind IPFS, Bitcoin block hashes, and the Nix package manager.

Two consequences fall out of it immediately, and they are most of what makes git feel magic:

  • Identical content always produces the identical hash, so git stores it once. Commit the same file in a thousand places and there is one object on disk.
  • The hash is a fingerprint. Change a single byte and the name changes completely. So comparing two hashes tells you, with effectively perfect confidence, whether two things are the same.

Everything else, blobs, trees, commits, branches, is built on top of that one move.

The store

A repository starts as a few folders. No database, no server, just files.

import hashlib, zlib, os

REPO = ".mygit"

def init():
    os.makedirs(f"{REPO}/objects", exist_ok=True)
    os.makedirs(f"{REPO}/refs/heads", exist_ok=True)
    open(f"{REPO}/HEAD", "w").write("ref: refs/heads/main\n")

Now the heart of it. To store content we hash it and write it down under that hash:

def hash_object(data, type_="blob"):
    full = f"{type_} {len(data)}\0".encode() + data
    oid = hashlib.sha1(full).hexdigest()
    path = f"{REPO}/objects/{oid[:2]}/{oid[2:]}"
    os.makedirs(os.path.dirname(path), exist_ok=True)
    open(path, "wb").write(zlib.compress(full))
    return oid

Three details matter, and they are exactly what real git does:

  • The header. Before hashing we prepend blob 11\0, the type and the byte length. Hash the same file without this header and you get a different result than git. This header is the reason our hashes will match.
  • zlib. Objects are stored compressed.
  • The fan-out. We file the object under a folder named after the first two characters of its hash. That is why a large repo does not end up with millions of files in one directory.

Reading is the mirror image: open the file, decompress, split off the header, and you are left with the original content.

def read_object(oid):
    full = zlib.decompress(open(f"{REPO}/objects/{oid[:2]}/{oid[2:]}", "rb").read())
    header, _, content = full.partition(b"\0")
    return header.decode().split()[0], content

Blobs, trees, commits

A blob is the contents of one file. That is just hash_object(file_bytes).

A tree captures a directory: a list mapping each filename to the hash of its blob. We hash that list too, so a folder also becomes a single object with a single name.

def write_tree(files):                       # files: {name: bytes}
    entries = b""
    for name in sorted(files):
        oid = hash_object(files[name])
        entries += f"100644 {name}\0".encode() + bytes.fromhex(oid)
    return hash_object(entries, "tree")

A commit records the tree for this snapshot, the commit that came before it, who made it, and a message. And then, like everything else, we hash it.

def commit(tree, message, parent, ts=1700000000, tz="+0000"):
    who = f"you <you@iwtlp.com> {ts} {tz}"
    lines = [f"tree {tree}"]
    if parent:
        lines.append(f"parent {parent}")
    lines += [f"author {who}", f"committer {who}", "", message]
    oid = hash_object(("\n".join(lines) + "\n").encode(), "commit")
    open(f"{REPO}/{branch_ref()}", "w").write(oid + "\n")
    return oid

This is the payoff of the design. Files, directories, and snapshots all become the same kind of thing: an object with a content hash for a name.

Branches are just files

This is the part that surprises people. A branch is a tiny text file holding one commit hash. The word main is a filename under refs/heads. There is no branch data type. Branches are a user interface on top of files.

def branch_ref():
    head = open(f"{REPO}/HEAD").read().strip()
    return head[5:] if head.startswith("ref: ") else None

def resolve_head():
    path = f"{REPO}/{branch_ref()}"
    return open(path).read().strip() if os.path.exists(path) else None

HEAD is a pointer to the branch you are on. To find the latest commit you read HEAD, follow it to the branch file, and read the hash inside. Making a commit writes the new hash into that file. History does not move. The label does.

The proof: this is really git

Here is where it stops being a lookalike. Real git computes a blob's hash as the sha1 of blob <len>\0<content>, which is exactly what hash_object does. So:

$ echo "hello world" | git hash-object --stdin
3b18e512dba79e4c8300dd08aeb37f8e728b8dad

and our hash_object(b"hello world\n") returns the same forty characters. The trees match too, because git's tree entry format is <mode> <name>\0<20 raw bytes>, sorted by name, which is what write_tree builds.

Commits are stricter. A real git commit has both an author and a committer line, each with a timestamp and timezone. Give our commit the same fixed timestamp and identity, and run real git with matching GIT_AUTHOR_DATE and GIT_COMMITTER_DATE, and the commit hashes line up as well. I wrote a tiny script that builds a blob, a tree, and a commit with both our code and real git and asserts all three are equal. They are. We are not mimicking git. We are git.

log and diff

History is a walk. Start at HEAD, read the commit, print it, follow the parent link, repeat until there is no parent left. No database query, just a loop following pointers.

def log():
    oid = resolve_head()
    while oid:
        text = read_object(oid)[1].decode()
        print(oid[:7], text.split("\n\n", 1)[1].strip())
        oid = next((l.split()[1] for l in text.splitlines() if l.startswith("parent ")), None)

And diff is the counterintuitive one. People assume git stores patches, the little deltas between versions. It does not. Git stores whole snapshots and computes the difference between any two on demand. Storing everything is cheap because identical blobs dedupe for free, and comparing is cheap enough to do the moment you ask.

import difflib

def diff(a, b):                              # a, b are blob hashes
    A = read_object(a)[1].decode().splitlines()
    B = read_object(b)[1].decode().splitlines()
    print("\n".join(difflib.unified_diff(A, B, "a", "b", lineterm="")))

There is a bonus property hiding in read_object. Flip a single bit inside a stored object and the decompressed bytes no longer match the hash in the filename. The read fails. Corruption is detected by design, because every read is also a checksum. This is the third thing Linus wanted when he wrote git in 2005, after fast and distributed: bulletproof.

How merge actually works

The video stops at a working git. Merge is where the object graph really earns its shape, so let us build it.

First, the mental model. Two branches have diverged. Each is a chain of commits pointing back to its parent. Somewhere back in time they share a common ancestor, the last commit they had before they split. A merge is: take the common ancestor as a baseline, take each side's version, and combine them.

Finding the common ancestor

Walk the ancestry of the first commit and collect every commit it can reach. Then walk the second commit back through its parents until you hit one of those. The first hit is the merge base.

def parents(oid):
    text = read_object(oid)[1].decode()
    return [l.split()[1] for l in text.splitlines() if l.startswith("parent ")]

def ancestors(oid):
    seen, stack = set(), [oid]
    while stack:
        cur = stack.pop()
        if cur in seen:
            continue
        seen.add(cur)
        stack.extend(parents(cur))
    return seen

def merge_base(a, b):
    a_anc = ancestors(a)
    stack = [b]
    while stack:                             # walk b's history, breadth is fine too
        cur = stack.pop()
        if cur in a_anc:
            return cur                       # first shared commit
        stack.extend(parents(cur))
    return None

Two special cases fall out for free. If the merge base is b, then b is already an ancestor of a and there is nothing to do. If the merge base is a, then a has no new commits, so you can fast-forward: just move the branch pointer to b. No merge commit needed. This is exactly git's "Fast-forward" message.

The three-way merge

When both sides have moved, we compare three versions of every file: the base (from the common ancestor), ours, and theirs. The rule per file is simple:

  • If the file is unchanged from base on one side, take the other side's version.
  • If both sides made the identical change, take it.
  • If both sides changed it differently, that is a conflict.
def tree_files(commit_oid):
    # map filename -> blob hash for the snapshot at this commit (flat tree)
    tree = read_object(commit_oid)[1].decode().split("tree ", 1)[1][:40]
    entries, raw = {}, read_object(tree)[1]
    i = 0
    while i < len(raw):
        nul = raw.index(b"\0", i)
        mode, name = raw[:nul].split(b" ")[-2:] if False else raw[i:nul].decode().split(" ", 1)
        oid = raw[nul + 1: nul + 21].hex()
        entries[name] = oid
        i = nul + 21
    return entries

def three_way(base, ours, theirs):
    b, o, t = tree_files(base), tree_files(ours), tree_files(theirs)
    result, conflicts = {}, []
    for name in set(b) | set(o) | set(t):
        bb, oo, tt = b.get(name), o.get(name), t.get(name)
        if oo == tt:                         # both sides agree (or both deleted)
            if oo: result[name] = oo
        elif oo == bb:                       # only theirs changed
            if tt: result[name] = tt
        elif tt == bb:                       # only ours changed
            if oo: result[name] = oo
        else:                                # both changed differently
            conflicts.append(name)
    return result, conflicts

If there are no conflicts, we write the merged set of files as a new tree and make a commit with two parents, ours and theirs. That second parent is the only structural thing a merge adds to the graph. Everything else is the same blobs, trees, and commits we already built.

def merge(ours, theirs, message="merge"):
    base = merge_base(ours, theirs)
    if base == theirs:
        return ours                          # already up to date
    if base == ours:
        # fast-forward: move the branch to theirs
        open(f"{REPO}/{branch_ref()}", "w").write(theirs + "\n")
        return theirs
    files, conflicts = three_way(base, ours, theirs)
    if conflicts:
        raise SystemExit("CONFLICT in: " + ", ".join(conflicts))
    tree = write_tree({n: read_object(o)[1] for n, o in files.items()})
    # a merge commit has two parents; here is the shape, simplified
    ...

Real git does the per-file step at the line level, not the whole-file level, which is why it can auto-merge two edits to different parts of the same file. But the structure is exactly this: find the common ancestor, three-way combine, and record a commit with two parents. Once you see that branches are pointers into a graph of content-addressed snapshots, merge stops being scary. It is just walking the graph.

What is left

Real git adds a staging area so you can build a commit a piece at a time, packfiles that delta-compress objects so history does not balloon, and remotes for pushing and pulling. Those are real engineering, but none of them change the core you just built. Content-addressed objects in a graph, named by their own hashes, is the whole idea, and now it is not a black box.

If you want to build the rest, the full from-scratch track is on iwtlp.com. Learn it by building it.