Simpson's paradox, and a detector for it in 20 lines
Here is a dataset that lies. Sixty-six people, each with two numbers: hours of exercise per week, and blood cholesterol. Fit a single straight line to all of them and the slope is positive: more exercise, more cholesterol. That is absurd, and it is also what the data says, until you split the points by age. Within every age band the slope is negative, the sane direction. Same points, opposite conclusion, depending only on whether you looked at the whole or the parts.
This is Simpson's paradox, and it is not a curiosity. In 1973 it nearly turned into a lawsuit.
The slope is two sums
You do not need a stats library to see this happen. The slope of a least-squares line is the covariance of x and y divided by the variance of x, and both are just sums:
def slope(xs, ys):
mx, my = sum(xs) / len(xs), sum(ys) / len(ys)
cov = sum((x - mx) * (y - my) for x, y in zip(xs, ys))
var = sum((x - mx) ** 2 for x in xs)
return cov / var
Run it on everyone pooled together and you get +12. Run it on each age band and every one comes back negative. Nothing is wrong with the math; the same function gives both answers.
Why pooling flips the sign
The reason is a clean identity. Total covariance decomposes into a within-group part and a between-group part:
cov_total = (average within-group covariance) + (covariance of the group means)
The within-group term is the honest signal: inside a fixed age, more exercise tracks with less cholesterol, so it is negative. The between-group term measures where the groups sit. Older people exercise more, because retirement, and older people also have higher cholesterol, so the group centroids climb to the right and upward together. That between-group covariance is large and positive, and when you pool, it swamps the within-group signal. The pooled line is mostly tracing the path through the group centroids, not the slope inside any group.
Age is doing two jobs at once: it moves a point along the x-axis and along the y-axis. That is the definition of a confounder, a variable tied to both axes, and a confounder is exactly what it takes to reverse a trend.
A detector in twenty lines
If the danger is that the pooled trend disagrees with every group, you can just check for it:
def reverses(rows, group_of):
xs = [r[0] for r in rows]
ys = [r[1] for r in rows]
pooled = slope(xs, ys)
groups = {}
for r in rows:
groups.setdefault(group_of(r), []).append(r)
parts = []
for g, members in groups.items():
if len(members) < 3:
continue
gx = [m[0] for m in members]
gy = [m[1] for m in members]
parts.append(slope(gx, gy))
flipped = all((p > 0) != (pooled > 0) for p in parts)
return flipped, pooled, parts
Feed it the exercise data grouped by age and it returns True, with pooled = +12 and three negative group slopes. Before trusting any pooled trend, run this against your plausible confounders: age, region, device, customer tier, time period. If it fires, the aggregate is hiding something.
The same trap in rates
It is not only about regression lines. In 1973 the University of California, Berkeley looked like it was rejecting women from graduate school: across the university, men were admitted at about 44 percent and women at about 30. It looked like clear bias. But split the applicants by department and women were admitted at an equal or higher rate in four of the six largest departments. Women were applying in larger numbers to the more competitive departments, the ones with low admit rates for everyone. Department was the confounder, moving both who applied and the admit rate. The aggregate reversed the truth. Same paradox, counted in percentages instead of slopes.
Why the detector is a smoke alarm, not a verdict
Here is the part most write-ups skip, and the part that matters most. The detector tells you that a reversal exists. It does not tell you which view is correct. That is a causal question, not a statistical one, and the honest answer depends on what the grouping variable actually is.
If the group is a genuine confounder that sits upstream of both variables, like age here, then the within-group slopes are the truth and the pooled slope is the artifact. But if you split on the wrong variable you make things worse. Condition on a collider, a variable that both x and y cause, and you can manufacture a paradox out of unrelated quantities. Condition on something that sits downstream of the treatment, a mediator, and you erase the very effect you were trying to measure. The math of reverses() cannot tell these cases apart; they all look like a sign flip.
So the rule is not "always trust the groups." It is "a reversal means stop and draw the causal arrows." The detector is the smoke alarm. The causal diagram is the investigation. If you want the formal version of all this, look up the d-separation rules for causal graphs; Simpson's paradox is what they were built to resolve.
The toy is worth building anyway, because once you have watched the same slope() function return +12 and a row of negatives on the same data, you stop trusting any single aggregate on sight. That instinct is most of what good data analysis is.