Is It a Regression, or Just a Flaky Test? Teach Your CI to Tell the Difference

A red build is a question, not an answer. It tells you something failed. It does not tell you the only thing you actually need to know at 4:58pm on a Friday: is this a real regression a human has to fix right now, or a flaky test that will be green on the next run and is just slowly poisoning everyone’s trust in the suite?

Get that wrong in either direction and it’s expensive. Treat every red as real, and you block good merges on noise until people start reflexively hitting “re-run” — at which point the suite has stopped meaning anything. Treat every red as flake, and a genuine bug rides a lucky rerun straight into production.

The usual “fix” — automatically retrying a failing test until it passes — is the worst of both worlds. It hides flakes instead of measuring them, and it lets a real regression through the moment a rerun happens to go green. We can do better, and it doesn’t take much code.

The one signal everybody throws away

Here’s the insight the retry-until-green crowd is stepping over: you already have the evidence you need, in your run history.

Every CI run happens against a specific commit (a git SHA). If you record the per-test outcome of each run alongside its SHA, one pattern is dispositive:

The same commit both passed and failed a test.

That’s not ambiguous. The code didn’t change between those two runs — the SHA is identical — so the result changing is nondeterminism by definition. That test is flaky, full stop, no matter how many times it’s currently red. This “same-SHA contradiction” is the single strongest, cheapest flake signal there is, and most CI setups throw it in the bin the moment the run finishes.

A real regression looks completely different in the history: green on the older commits, then failing and staying failed on the newest one(s). The failures cluster in one unbroken block at the recent end of the timeline, and they don’t un-fail on their own.

A tiny classifier

That’s enough to write a classifier. Given a per-test series of (sha, outcome) points ordered by time, we decide, strongest signal first:

Same-SHA contradiction → flaky. One commit produced both a pass and a fail. Nothing outweighs this.
Monotone trailing failures → regression. Every failure is in one unbroken block at the newest end, that block is long enough to trust (say, 3 consecutive runs), and there was a green run before it.
Scattered, intermittent red with no trailing block → flaky. It fails sometimes, on different commits, but keeps recovering.
Never green in the window → broken. New or perpetually red.

In TypeScript the core is about forty lines:

function classify(points: Point[], cfg: Config): Verdict {
  const fails = points.filter(p => p.outcome === 'failed').length;
  if (fails === 0) return 'stable';

  // (1) Same code, different result — the strongest flake proof.
  if (hasSameShaContradiction(points)) return 'flaky';

  // (4) Never passed in the window.
  if (fails === points.length) return 'broken';

  // (2) All failures in one trailing block, long enough, with an earlier pass.
  const trailing = trailingFailures(points);
  const hadEarlierPass = points
    .slice(0, points.length - trailing)
    .some(p => p.outcome === 'passed');
  if (trailing === fails && trailing >= cfg.regressionMinConsecutive && hadEarlierPass) {
    return 'regression';
  }

  // (3) Intermittent red that keeps recovering.
  return 'flaky';
}

Note trailing === fails: if there are also older, scattered failures, the red isn’t a clean regression — it’s an intermittent test, so we fall through to flaky. Precision matters here; a false “regression” that’s really a flake trains people to ignore the label.

Quarantine, don’t retry

Classification only helps if the CI gate acts on it. The rule that keeps a suite trustworthy:

A regression or a broken test failing on the latest run blocks the build. That’s the whole point — catch real breakage.
A quarantined flaky test failing does not block. It’s still recorded, still shows up on the dashboard, still owes someone a fix — but it cannot hold merges hostage.

A test earns quarantine by being classified flaky and crossing a flake-rate threshold (failures ÷ runs, over a recent window), or by a human pinning it. Crucially, we never auto-retry to hide the failure — we measure the flake rate and surface it. A flake you can see on a dashboard gets fixed; a flake your CI silently swallowed on rerun gets worse.

This is an abstain gate. When the evidence says “this red is noise,” the gate declines to fail the build — but it says so out loud, with a number attached, instead of pretending everything’s fine.

Where this really bites: integration-heavy products

This isn’t academic. Any product that lives inside other people’s systems has this problem in the extreme. Think of a library link-resolver like LibKey that plants “one-click to the PDF” links inside dozens of third-party discovery platforms, databases, and now AI tools — every one an external surface that can change under you overnight. When your e2e test goes red there, “your regression” and “a partner’s page changed” are genuinely hard to distinguish from a single run. The only way to stay sane is to look at the history: a same-SHA contradiction says flake; a clean green→red-on-the-newest-commit says the last thing you shipped broke it.

The same is true for browser extensions testing against live publisher DOMs, mobile apps across OS versions, and anything with a big matrix of institutional configurations. The more external surface area, the more your quality signal has to come from patterns over time, not any single red/green.

What I deliberately left out

Because judgment is the job:

No retry-until-green. Explained above — it’s the anti-pattern this whole approach exists to replace.
No statistical confidence intervals (yet). With a short history, a Wilson/Bayesian interval on the flake rate is the honest next step before auto-quarantining a test on thin evidence. The threshold plus a minimum-consecutive guard stand in for now, and I’d add the interval before trusting auto-quarantine at scale.
A JSON file for storage, not a database. The store is an append-only list with the exact shape a test_results table would have; the dashboard query is a windowed group-by. Swapping in Postgres is a store-layer change, not a rewrite.

The takeaway

Flaky tests aren’t a moral failing to be stamped out with reruns — they’re a signal to be measured. Record every run with its commit, let the history tell you same-SHA-contradiction (flake) from monotone-trailing-red (regression), quarantine the flakes loudly, and block only what’s real. It’s a few dozen lines of code, and it turns a red build back into something that means what it says.