warming up your workspace

The three joins, explained by implementing them

You write JOIN in SQL and rows from two tables come back matched up. The database makes it look like one operation, but underneath there are really only three algorithms doing the work: nested-loop join, hash join, and sort-merge join. Every join you have ever run was one of these.

They are short, each is a few lines of Python, and writing them is the fastest way to understand the thing experienced engineers know intuitively: why the query planner picks a different join depending on the data. Let us build all three and watch their trade-offs fall out.

The setup

Two lists of dicts, joined on a shared key, like users JOIN orders ON users.id = orders.user_id:

users = [
    {"id": 1, "name": "Ada"},
    {"id": 2, "name": "Linus"},
    {"id": 3, "name": "Grace"},
]
orders = [
    {"user_id": 1, "item": "book"},
    {"user_id": 1, "item": "pen"},
    {"user_id": 3, "item": "lamp"},
]

An inner join returns one combined row per matching pair. Ada has two orders, so she appears twice; Linus has none, so he doesn't appear.

1. Nested-loop join: the obvious one

Compare every left row against every right row. If the keys match, emit the combined row.

def nested_loop_join(left, right, lkey, rkey):
    out = []
    for l in left:
        for r in right:
            if l[lkey] == r[rkey]:
                out.append({**l, **r})
    return out

nested_loop_join(users, orders, "id", "user_id")
# Ada+book, Ada+pen, Grace+lamp

It is correct and dead simple. The problem is the two nested loops: with n left rows and m right rows it does n × m comparisons. A thousand users and a million orders is a billion comparisons. This is O(n·m), and it is why a join on big unindexed tables can hang.

It is not useless, though. For tiny inputs it wins precisely because it is simple, no setup cost. Databases use it for small tables or when there's no equality to exploit (e.g. joining on a.x < b.y).

2. Hash join: trade memory for speed

The nested loop re-scans the right side for every left row. Instead, scan the right side once and build a hash table from key to rows. Then each left row is a single lookup.

def hash_join(left, right, lkey, rkey):
    index = {}
    for r in right:                      # build phase: one pass over right
        index.setdefault(r[rkey], []).append(r)

    out = []
    for l in left:                       # probe phase: one pass over left
        for r in index.get(l[lkey], []):
            out.append({**l, **r})
    return out

One pass to build, one pass to probe: O(n + m) instead of O(n·m). That billion-comparison join becomes about two million operations. The cost is memory, the hash table holds the right side, and it only works for equality joins (you can't hash a <).

This is the workhorse for large equality joins with no useful sort order. The catch in the real world: if the build side is bigger than memory, the database has to spill it to disk in partitions, which is where "hash join" gets its reputation for being memory-hungry.

3. Sort-merge join: when the data is already ordered

Sort both sides by the join key, then walk them with two pointers, advancing whichever is behind, like merging two sorted lists.

def merge_join(left, right, lkey, rkey):
    left  = sorted(left,  key=lambda x: x[lkey])
    right = sorted(right, key=lambda x: x[rkey])
    out = []
    i = j = 0
    while i < len(left) and j < len(right):
        lv, rv = left[i][lkey], right[j][rkey]
        if lv < rv:
            i += 1                       # left is behind, advance it
        elif lv > rv:
            j += 1                       # right is behind, advance it
        else:                            # keys match: pair all equal rows
            j_end = j
            while j_end < len(right) and right[j_end][rkey] == lv:
                j_end += 1
            i_start = i
            while i < len(left) and left[i][lkey] == lv:
                for k in range(j, j_end):
                    out.append({**left[i], **right[k]})
                i += 1
            j = j_end
    return out

The inner block looks busy, but it is just handling duplicates correctly: when both sides have several rows with the same key, every left match pairs with every right match. The walk itself is O(n + m); the cost is the sort, O(n log n + m log m)unless the data is already sorted, in which case the sort is free and merge join is unbeatable.

That "already sorted" case is exactly when planners reach for it: data coming off a sorted index, or the output of an earlier sort. As a bonus, the result comes out sorted, which the next operation might need anyway.

The payoff: now the planner makes sense

All three return the same rows. They differ only in cost, and that is the whole reason a database has a query planner:

Join Cost Best when
Nested-loop O(n·m) inputs tiny, or a non-equality condition
Hash O(n + m), uses memory large equality join, no useful order
Sort-merge O(n log n + m log m), or O(n+m) if pre-sorted data already sorted, or output needs to be

When you EXPLAIN a query and see "Hash Join" or "Merge Join," the planner estimated row counts, checked for indexes and sort order, and picked the cheapest of these three. That is also why adding an index can flip a slow nested-loop into a fast merge or hash join, and why a join that was fast on small data falls off a cliff as the tables grow: the planner switched strategies, or should have.

Knowing the three algorithms turns the query planner from a black box into something you can predict and nudge. That move, building the engine so the buttons make sense, is the whole idea of the data engineering track, where joins are one stop on the way to building a small query engine from scratch.