Skip to content
Chapter 88Lesson 8

Flake has a structural cause

Diagnose and fix flaky tests in your Vitest integration suite by treating every flake as a hidden input that sorts into one of two buckets, a state leak or an order dependency, each with its own structural fix.

You’ve built the whole integration suite across this chapter: rollback per test, a database per worker, the signedInAs fixture, MSW at the wire, webhook receivers, and Server Actions through their full wrapper. Now you ship it, and a week later you hit the single most expensive bug a test suite can produce.

A test goes red on a pull request. You open it, and the failure has nothing to do with your change. It’s a createInvoice test in a file you never touched. You run it again. Green. You re-run the CI job. Green. The Slack message writes itself: anyone else seeing CI flake? And your hand is already moving toward the Re-run failed jobs button.

Stop there, because that button is the mistake. That test is not unreliable. Your suite has leaked state from one test into the next, or it has an order dependency, or it reads a clock you didn’t fake. Every one of those has a structural cause and a structural fix. Flake is not bad luck. It is determinism you haven’t found yet.

That reframe is the whole lesson. By the end you’ll be able to name the cause of any flake and fix the structure, instead of re-running and hoping. The lesson is short because you already own almost every fix. Rollback, resetHandlers, useRealTimers, and mock reset were each a flake fix the moment you learned them. Here those scattered rules collapse into one idea.

Before any technique, it’s worth seeing why this matters, since the fixes here are mostly one-liners you already know. What’s expensive isn’t the flaky test itself. It’s the habit a team grows around tolerating it.

Do the arithmetic, small and concrete. A team merges 200 pull requests a week. The suite flakes 5% of the time, so one run in twenty goes red for no real reason. That’s ten spurious red builds a week. Each one costs CI minutes on the re-run, but that’s the cheap part. The expensive part is a developer pulled out of their own work to investigate a failure that was never their bug, and then, run after run, learning to glance at red and ignore it.

That last move is the real damage. Once “just re-run it” is the team reflex, the suite stops being a signal. Red no longer means something broke; it means roll the dice again. Now a real regression, a genuine bug your test correctly caught, hides behind exactly the same yellow the team has trained itself to dismiss. Tolerated flake doesn’t cost you the flaky test. It costs you every other test’s credibility.

So hold onto the sentence the rest of this lesson hangs off: every flake has a named cause and a structural fix. The work is never “make it less flaky.” It’s “find the cause, fix the structure, prove it’s gone."

"Intermittent” is a symptom, not a diagnosis

Section titled “"Intermittent” is a symptom, not a diagnosis”

The most important habit to break is stopping at the symptom. When a developer says “the test is flaky,” they’ve named the symptom and gone no further. “Intermittent,” “non-deterministic,” and “sometimes it fails” are not causes. They’re the thing you’re trying to explain, dressed up as if they were the explanation.

So refuse “intermittent” as a root cause and ask one question instead: what does the failing run inherit from a run before it, or depend on, that the passing run doesn’t? Given identical inputs, code is deterministic: run it a thousand times, get the same answer a thousand times. So if a test’s result changes between runs, an input changed that you didn’t think of as an input. Flake is just an input you haven’t found yet, and your job is to find it.

That hidden input always lands in one of two buckets. These two are the spine of the entire lesson, and once you have them the rest is detail.

The first bucket is a state leak. A test leaves a mutation behind, such as an un-rolled-back row, a stacked MSW handler, a mock implementation, fake timers still engaged, or a value pushed into a shared array, and the next test runs against that dirty world. The tell is unmistakable: the test passes alone and fails in the suite. Run it by itself and there’s nothing to inherit, so it’s green; run it after its noisy neighbor and it inherits the mess.

The second bucket is order or nondeterminism. The test passes only because of when it ran: the order other tests ran in, the actual wall-clock time, or a random value. The tell here is different: it passes in one order and fails in another, or it passes today and fails at midnight.

This split is worth learning before anything else because the shape of the fix follows directly from the bucket. The two buckets call for different kinds of fix:

  • A state leak is fixed by isolation or reset, placed structurally in a fixture or an afterEach, never in the test body. Put it in the test body and the next person who writes a test forgets it, and you’re back where you started. Isolation is the goal, and it belongs to the setup, not the test.
  • An order dependency is fixed by removing the dependency: make every test self-contained, and seam the clock, IDs, and randomness so nothing is nondeterministic . Then prove it gone by deliberately scrambling the run order and watching it stay green.

So the diagnostic order is always the same: first decide which bucket, then let the fix follow. The diagram below makes that decision visual, with one symptom forking into two causes, each pointing at its own shape of fix.

Symptom passes alone, fails in the suite passes one order, fails another
State leak a test leaves a mutation the next test inherits
Order / nondeterminism the test depends on run order, real time, or a random value
Isolate or reset structurally, in a fixture or afterEach
Remove the dependency seam it, then prove with --sequence.shuffle.files
First decide which bucket the symptom points to; the shape of the fix follows from the bucket, not from the test.

Here is the taxonomy. Read it as a reference table you scan when a real test goes yellow, not a list to memorize. You’ve already met almost all of these, and the fix for each was taught earlier in this chapter or the last, so each entry is just cause, fix, and where the mechanics live. The point isn’t the nine items. It’s that they sort cleanly into the two buckets, and the bucket tells you the fix shape.

The two cards below are that sort. Scan the left card when a test “passes alone, fails in the suite,” and the right card when it “passes in one order, fails in another.”

Bucket A — state leaks

Tell: passes alone, fails in the suite. Fix shape: isolate or reset, structurally.

  • DB-state leak: a factory or signedInAs called outside withRollback commits real rows every later test sees. Fix: run every test body inside withRollback, where the tx rollback is the isolation.
  • MSW handler leak: a per-test server.use(...) override survives into the next test. Fix: server.resetHandlers() in afterEach, wired once in the integration setup file.
  • Mock-implementation leak: vi.mocked(auth.api.getSession).mockResolvedValue(...) set in test A still answers in test B. Fix: vi.resetAllMocks() (or a targeted mockReset) in afterEach. Setup-file mocks are not auto-reset, so you name the reset yourself.
  • Timer leak: vi.useFakeTimers() with no vi.useRealTimers() in afterEach, so fake time carries forward and a later “after 1s” test hangs to the timeout. Fix: restore real timers in afterEach.
  • Port collision: two suites, or a stray dev server, bind the same port, and whoever loses flakes. Fix: a dedicated test port (the test Postgres on 5433) and a database per worker so workers can’t contend.
  • Shared mutable module state: a top-of-file const seen = [] (or any module-scope singleton) mutated by tests. Fix: declare capture arrays inside the test body, never at module scope.

Bucket B — order / nondeterminism

Tell: passes one order, fails another; passes today, fails at midnight. Fix shape: remove the dependency, then prove it with --sequence.shuffle.files.

  • Order dependency: test B only passes because test A ran first and left a row, set a mock, or advanced a sequence. Fix: make every test self-contained, and surface the bug with vitest run --sequence.shuffle.files.
  • Real-time clock: code reads Date.now() or new Date() directly, so a time assertion passes by day and fails near a boundary. Fix: read time through the clock seam (lib/clock.ts) and freeze it in tests.
  • Inline randomness or unstable data: Math.random(), crypto.randomUUID(), or Date.now() used inline in production or test data, so values differ run to run. Fix: route randomness and IDs through their seams (lib/random.ts, lib/ids.ts), and assert on shape (expect.stringMatching, expect.any), never an exact, sequence-derived ID.

Look at what those two lists actually are. Every entry in the left card is the same bug in a different disguise: a test left state behind, and the structural fix is a reset in afterEach or a fixture. Every entry in the right card is the same bug too: a hidden non-deterministic input crept in, and the structural fix is a seam. The six “reset in afterEach” rules are one rule. The three “route it through a seam” rules are one rule. The nine collapse to two.

One close cousin doesn’t fit either bucket, because it isn’t really a leak or an order bug: a forgotten await on an async assertion, covered in Async tests without the forgotten-await trap. The test finishes before the assertion runs and passes by skipping the check entirely. await expect(...).resolves and expect.assertions(n) are the guard there. It’s named here only so you don’t mistake it for flake.

That collapse is the skill. When a real test flakes, you don’t pattern-match against nine memorized cases. You ask “does it pass alone or fail alone?”, land in a bucket, and the fix shape is already decided. Try that first decision now, on a pile of symptoms.

You're triaging flaky tests. For each symptom or one-line smell, decide which bucket it lands in — that first decision selects the fix shape. Drag each item into the bucket it belongs to, then press Check.

State leak Passes alone, fails in the suite — reset or isolate
Order or nondeterminism Passes one order, fails another — remove the dependency
signedInAs called before withRollback
server.use(...) with no afterEach reset
vi.useFakeTimers() and no useRealTimers
A top-of-file const seen = [] the tests push into
A leaked mockResolvedValue from an earlier test still answering
The test asserts row.id === 5
Production code calls Date.now() directly and a test asserts on time
Test B fails when it runs before test A
Math.random() inside the input factory

Everything so far was the frame and the recap. This is the part you don’t have yet: the diagnostic loop that turns “it’s flaky sometimes” into a located, reproducible cause you can fix and prove fixed. It’s four steps, and one rule sits underneath all of them: never debug a flake you can’t reproduce on demand.

Step one: quantify with repeats. “Flaky sometimes” is unactionable, but a rate is actionable. Vitest has no --repeat flag on the command line. Repetition is a per-test option you attach to the suspect, and then you run just its file.

src/server/actions/create-invoice.int.test.ts
it('creates an invoice', { repeats: 100 }, async () => {
// ...the test body, unchanged
});

Then run that one file:

Terminal window
vitest run src/server/actions/create-invoice.int.test.ts

If it comes back 3/100 failed, you have a 3% flake rate: a number, on demand. Quantify before you investigate, for two reasons. First, the rate tells you whether you’ve actually reproduced the thing. A flake you can’t make fail in 100 runs isn’t reproduced, and you’re about to debug a ghost. Second, the rate is how you’ll know a fix worked. 100/100 green after your change is the bar, not “seems fine now,” which is just the flake hiding again. The repeats option is how you force reproduction, so the rest of the loop has something to work with.

Step two: localize order bugs with shuffle. A leak that only fires in a specific run order is invisible to repeating a single file in source order: it’ll go 100/100 green because nothing reorders. So you scramble the order on purpose. The vitest run --sequence.shuffle.files command randomizes which order the files run in, and --sequence.shuffle.tests randomizes the order within a file. A suite that’s green in source order but goes red under shuffle is an order dependency or a cross-file leak. That’s proof, not suspicion.

Terminal window
vitest run --sequence.shuffle.files

Reproducibility is the whole point of a diagnostic. When a shuffled run fails, Vitest prints the seed it used. Feed that seed back and the exact failing order replays deterministically:

Terminal window
vitest run --sequence.shuffle.files --sequence.seed 8675309

Now you have a failing order you can run as many times as you like while you hunt the cause. Rather than wait for a shuffle to find you by accident, turn shuffle on in your config so it runs on every CI run, or at minimum a scheduled weekly job, so an order dependency surfaces the day it’s introduced rather than three months later in someone else’s pull request.

Step three: read the bucket off the symptom. The tool that reproduced it already narrows the bucket for you. If it reproduces under repeats in source order with no shuffle needed, it’s a state leak inside the file or genuine nondeterminism. If it’s clean alone but red only under shuffle, it’s an order dependency or a cross-file leak. Either way you’re back at the two buckets, but now you got there with proof instead of a guess.

Step four: fix structurally, then re-prove. Apply the bucket’s fix, whether that’s reset or isolate, or seam and remove the dependency, and re-run with repeats (and shuffle, if that’s how you caught it) until it’s green every time. The fix isn’t done when the test passes once. It’s done when it can’t fail under the tool that caught it.

Watch the whole loop run on one concrete test by scrubbing through it.

CI — pull request #482
create-invoice.int.test.ts a change that never touched this file
local — your machine
create-invoice.int.test.ts re-run it: green again
The symptom: red on a pull request that never touched this file; green when you run it locally. 'Intermittent' is where most people stop.
create-invoice.int.test.ts
it('creates an invoice', { repeats: 100 }, async () => { // …the test body, unchanged });
terminal
$ vitest run create-invoice.int.test.ts 7/100 failed · source order, no shuffle
Add { repeats: 100 } and run the file — a rate, on demand. It fires in source order with no shuffle, so it's a state leak (bucket A).
a test, earlier in the run
vi.mocked(auth.api.getSession) .mockResolvedValue(adminSession);
tests/integration/setup.ts
afterEach(() => { server.resetHandlers(); // no mock reset — the gap });
Name the taxon: an earlier test's admin session bleeds in because setup-file mocks aren't auto-reset. Mock-implementation leak.
tests/integration/setup.ts
afterEach(() => { server.resetHandlers(); vi.resetAllMocks(); });
terminal
$ vitest run create-invoice.int.test.ts 100/100 · fixed, and proven
The structural fix lands in the setup file, not the test body. Re-run with repeats: 100 → 100/100. Fixed, and proven — not hoped.

Notice where the fix went in that last step: into the setup file’s afterEach, one line, for the whole project.

src/test/integration.setup.ts
afterEach(() => {
vi.resetAllMocks();
});

That’s the entire payoff of the two-bucket model. You didn’t memorize “mock leaks need resetAllMocks.” You reproduced a rate, read “fires in source order” as bucket A, named the leak, and reached for the reset that bucket always wants, placed in the setup file where the next person can’t forget it.

One tool looks like it fixes flake but does the exact opposite, and you should leave this lesson unwilling to reach for it.

vitest run --retry=3 re-runs a failing test up to three times and reports green if any attempt passes. On its face, that’s flake-tolerance built into the runner. In reality it takes your suite’s one honest signal, this test is non-deterministic, and silences it. The flake is still there; you’ve just configured the suite to stop telling you. It’s “just re-run it” promoted to config and applied to every test, forever, automatically.

The harm is worse than hiding one flaky test, because retry hides a whole category. A real intermittent regression, a genuine race in production code that your test correctly catches one run in twenty, now passes under retry in exactly the same way a leaked mock would. You’ve configured your suite to lie about a class of real bugs. The race ships, and the test was green.

The contrast is sharpest side by side. One of these silences the signal; the other removes the cause.

Terminal window
vitest run --retry=3

Green builds, hidden bug. The failing run is retried until one attempt passes, so the suite reports success while the flake count climbs silently underneath. You’ve muted the signal, not removed the cause. A real race ships looking exactly this green.

So the course rule is flat: --retry on test-logic flake is forbidden. The fix for a flaky test is always the structural fix from the bucket it belongs to.

There is exactly one exception. Infrastructure flake is genuinely outside your test’s determinism: the CI runner’s network blips while pulling a container image, a database container occasionally needs a second to start accepting connections, an external sandbox times out. That isn’t your code being non-deterministic; it’s the world being non-deterministic around it. A scoped retry on that boundary is legitimate: retry the container-startup step, not the test suite.

The line is sharp: retry the infrastructure, never the test logic. If a retry is what makes your code’s test pass, the retry is hiding your bug.

Sometimes a flake hits main, the whole team is blocked behind red builds, and the root cause needs real investigation time you don’t have in the next hour. You need a release valve, and --retry is the wrong one because it’s permanent and silent. There’s a disciplined alternative.

Quarantine, with a leash. Skip the test visibly, so it still runs where you can see it. it.skipIf(process.env.CI) keeps it running locally while taking it out of the CI gate; alternatively, move it to an excluded *.flaky.test.ts lane that the integration project’s glob doesn’t pick up. Quarantine differs from --retry the way loud-and-finite differs from silent-and-forever: a quarantined test is visibly skipped, not quietly passing, and it carries an owner and a tracking issue right there in a comment.

src/server/actions/create-invoice.int.test.ts
// QUARANTINED 2026-06-12 — @maria — flaky under shuffle, see APP-4821.
// Re-enable once the cross-file order dependency is fixed; do not delete.
it.skipIf(process.env.CI)('creates an invoice', async () => {
// ...
});

That comment is not decoration. A quarantine without a tracking issue is just --retry with extra steps: debt you’ve decided to forget, and it never comes back. The reason quarantine is acceptable and --retry isn’t comes down to that follow-up. Quarantine buys you time to do the structural fix, and it is never the fix itself.

Time to put the whole thing together. The durable skill here isn’t recalling which afterEach fixes which leak. It’s asking the questions in the right order: reproduce before you theorize, and pick the bucket before you name the specific cause. Walk a failing test from symptom to structural fix, and at each step choose the move you’d actually make.

A test is failing intermittently — walk it to the cause

That walker only covered the state-leak branch. The other root works the same way: if the test is clean alone but red under vitest run --sequence.shuffle.files, it’s an order dependency. Make the test self-contained, and prove the fix by replaying the exact failing order with the reported seed (--sequence.shuffle.files --sequence.seed <seed>) until it’s green.

Before you go, here are three statements that catch the highest-value misconceptions. Mark each true or false.

Each claim is about diagnosing and fixing a flaky integration test. Mark each statement True or False.

--retry is an acceptable fix for a flaky integration test.

It’s forbidden for test-logic flake. --retry re-runs the failing test until one attempt passes and reports green — it hides the cause and silences the signal, including for real intermittent regressions. The only legitimate retry is scoped to infrastructure (container startup, CI network), never your test logic.

A test that passes alone but fails in the suite has a state leak.

That’s the defining tell of bucket A. Run alone, there’s nothing to inherit, so it’s green; run after a noisy neighbor, it inherits the leftover row, handler, mock, or timer. The fix is to isolate or reset — structurally, in a fixture or afterEach.

To measure a flake rate, run the test with the --repeat 100 flag.

There is no --repeat CLI flag in Vitest. Repetition is a per-test option — it('…', { repeats: 100 }, fn) — and you run that file. The whole-suite knob lives in config under test.sequence; the per-test option is the debugging reach.

Two of these are the source of truth for the flags in this lesson, worth bookmarking, since stale tutorials still reach for a --repeat flag that never shipped and a bare --shuffle that isn’t the real form. The other two are where the two-bucket model and the “flake rate under 1%” bar come from: the canonical essay on the cause, and the team-scale evidence for why tolerating it rots a suite.

That closes the integration arc. You can build the suite, and now you can keep it honest. Every time a test goes yellow from here on, the move is the same: reproduce it, find the bucket, fix the structure, prove it’s gone. Never re-run and hope.