Chapter 90Lesson 1

The money-path filter

The money-path filter, a decision gate that keeps Playwright end-to-end tests off by default on a 2026 Next.js SaaS and reserves them for the few paths where failure moves money, breaks identity, or loses unrecoverable data.

A feature ships. Somebody opens a pull request, and at the bottom of the diff there’s a new file under tests/e2e/. Nobody decided it needed one. It went in by reflex, the test pyramid absorbed from a hundred blog posts: new flow, so write an end-to-end test that clicks through it in a real browser.

Six months later, the suite has eighty end-to-end tests and CI takes twenty-five minutes. Every Tuesday morning a third of them go red for no reason anyone can reproduce, and the team has quietly learned the workaround: hit re-run until it’s green. The part worth sitting with is that the bug which actually shipped to production, the one that cost a weekend and an apology email to customers, wasn’t covered by any of the eighty. And the wasted minutes aren’t even the real cost. Retry-until-green has trained the team to stop trusting a red build, so the suite that was supposed to be the safety net is now the thing everyone routes around.

This lesson exists to prevent that failure mode, and the fix is to invert the reflex. For a Next.js 16 SaaS in 2026, the end-to-end suite is off by default. You don’t write a Playwright test because a flow exists. You write one only when the flow crosses a single specific gate, the money-path filter. For everything outside that gate, the experienced move is to not write the test, or to delete the one that’s already there.

That framing should feel familiar. It’s the same shape you met with TanStack Query and Zustand: the platform gives you a default, and you reach for the heavier tool only when the default’s limit is crossed. You met it again one chapter ago with React Testing Library, off by default until a named trigger fires. This is the third instance of an idea you already own, “trigger before tool”, and this time it’s aimed at the top of the honeycomb. Recall that suite shape from earlier in this unit: a wide unit base, the center of gravity at the seams where the integration tests live, a thin component band above that, and the thinnest band of all at the very top, gated by a trigger. That top band is end-to-end, and this whole lesson lives inside it.

So here is the one sentence to carry out of this lesson, the thesis everything else hangs from:

Playwright is a money-path tool, not a coverage tool.

You’ll write no Playwright code here. The next lesson covers the config and the mechanics, and the one after walks the handful of paths that actually qualify. What you’ll leave with is a five-second gate you run before you create a test file, a gate whose answer is almost always to not create it.

What the money-path filter is

The filter is a single question you ask of a candidate path, and it’s easiest to remember as three failure classes. A path earns the seconds-per-test it costs only when failing on that path means one of exactly three things. Take them one at a time, because the concrete consequence is what makes each one stick.

One: money moves wrong. A charge fires twice. A plan downgrade silently fails and the customer keeps paying for a tier they cancelled. A refund never reaches the card. A user pays and doesn’t get the plan, or gets the plan without paying. Every one of these is a line item on someone’s statement and a support ticket with your company’s name on it.

Two: identity breaks. Sign-in goes down in production, and every paying customer is locked out of the product they pay for. That’s not a degraded experience, it’s a closed door. A subtler version: a session that doesn’t survive a redirect, so the user signs in, gets bounced through a third party, and lands back logged out.

Three: unrecoverable data is lost or exposed. An invitation grants access to the wrong organization. In a multi-tenant app, that’s one customer reading another customer’s data, the worst sentence in a breach report. An export silently omits records the user is relying on. A delete the user can’t undo fires on the wrong row.

That’s the whole filter: money, identity, unrecoverable data. Commit those three words to memory, because they’re the gate. Everything else stays at the seam with an integration test, at the component with React Testing Library, or off the test menu entirely. And “everything else” is the overwhelming majority of what your app does.

Notice the direction the filter runs. It isn’t a checklist for finding more paths to test; it’s the opposite. Its entire job is subtractive, to keep paths out of the end-to-end suite. Almost everything you point it at fails it, and failing it is the normal, correct outcome. That’s the discipline.

The cost shape that makes E2E expensive

“Off by default” isn’t a slogan. It falls straight out of the math, and there are two terms in that math: what a test costs, and how often it flakes. This section installs the first one. Once you’ve felt the cost shape, the default stops sounding like an opinion and starts sounding like arithmetic.

You met the cost ladder earlier in this unit, so stretch it up one rung. A unit test over a /lib helper runs in about five milliseconds. An integration test that hits a real Postgres and rolls the transaction back runs in twenty to eighty. A component test, booting jsdom and rendering a tree, runs in a hundred to three hundred. A Playwright test spins up a real browser, navigates real pages, and waits on a real network, so it runs in two to ten seconds, often more under CI’s thinner resources. That’s three orders of magnitude over the unit test: not a little slower, a thousand times slower.

The diagram below puts the four tiers on one axis so the gap is something you see rather than something you read. The bars aren’t drawn to scale, because a true-to-life unit-test bar would be a single invisible pixel next to the E2E bar, so the real runtime is printed on each one. Let the long amber bar at the bottom be the thing your eye lands on.

Unit a /lib helper

~5 ms

Integration real Postgres, rolled back

20–80 ms

Component rendered in jsdom

100–300 ms

End-to-end a real browser, via Playwright

2–10 s ≈1000× the unit test

The cost of a test by tier, with the runtime printed on each bar (the bars are not to scale). An end-to-end test is the most expensive test you can write, by orders of magnitude.

A single slow test doesn’t hurt. The damage is at the suite level, and one number governs the whole discipline: the count. A thirty-test Playwright suite, run in parallel, costs roughly two minutes in CI, a price worth paying on every pull request. A two-hundred-test suite is a twenty-minute pole that the team starts skipping in order to ship, and a test suite you skip is no test suite at all. So the discipline isn’t in any one test, it’s in the count. The same way React Testing Library disciplined your component count one band down, the money-path filter disciplines your end-to-end count up here. And because each test up here costs a hundred times more, the ceiling on the count is a hundred times lower.

The flake budget is structurally larger

Cost is only half of why the bar is so high. The other half is flake, and flake is the half that ambushes people the first time they live with an end-to-end suite. A flaky test passes and fails on the same code, with nothing changed between runs.

End-to-end flakes more than anything beneath it on the ladder, and here’s why. A unit test is one function and its inputs, deterministic by construction. Climb the ladder and each rung bolts on another source of real-world nondeterminism: a real database, a real network round-trip, a real third-party UI like Stripe’s hosted checkout, real browser timing. By the time you’re at the top, all of those compound into a flake surface the lower tiers structurally don’t have. The more moving parts you don’t control, the more ways the same test can land differently twice in a row.

This is where a specific anti-pattern is born, so name it and refuse it now: reaching for retries: 3 to make CI green. It feels like a fix, but it is the opposite of one. A test that only passes on the third attempt is telling you something true: there’s a race in your app or your test, a window where the result depends on which thing finishes first. Cranking up the retry count doesn’t close that window. It mutes the alarm and throws away the signal, the race is still there, and your users will find it.

The 2026 stance is a principle you can state in one line: flake gets a structural fix, not a retry bump. That means better locators that find an element the way a user would instead of by a brittle CSS path, assertions that automatically wait for the right state instead of a fixed sleep, and seed data that’s deterministic run to run. How you do each of those is the next lesson’s job, and the config knobs for retries and traces live there too. What you install here is the rule: when an end-to-end test flakes, you fix the cause rather than paper over it.

What only E2E can catch, and what it can’t

By now you might be over-correcting toward “end-to-end is useless.” It isn’t. It does exactly one thing nothing cheaper can, and that one thing is worth real seconds on the paths that qualify. The skill is holding both halves at once: the narrow thing it’s uniquely good at, and the wide set of things people wrongly reach for it to do.

Here’s the unique thing. Every cheaper test sees a piece of your system in isolation: one function, one Server Action against the database, one component rendered in a fake DOM. An end-to-end test sees the whole thing composed, routing into middleware into a Server Action into the database into the third party, then the rendered HTML coming back, the JavaScript hydrating it, and the cookie surviving the next navigation. Some bugs don’t live in any single piece. They live in the seams between the pieces: a redirect that loops, a session that doesn’t survive the round-trip out to Stripe and back, a webhook that lands a beat before the UI re-fetches so the user stares at a stale screen. No isolated test can see a composition bug, because the bug isn’t in any of the parts it isolates. Only a browser driving the assembled app sees it.

So the positive rule is sharp: composition is the only justification for an end-to-end test. If the bug you’re worried about isn’t a composition bug, a cheaper test already owns it, and owns it faster and more reliably.

The trap is the mirror image: reaching for end-to-end to catch a bug that isn’t about composition. That’s precisely how the slow, flaky suite from the opening gets built, one well-intentioned test at a time. The two tabs below put the line side by side. Read them as two non-overlapping jobs, not a preference.

Only E2E sees this
A cheaper test already owns this

These are composition bugs. No isolated test reaches them, because the bug is in how the pieces fit together:

A redirect that loops: sign-in sends you to the dashboard, which sends you back to sign-in.
A session that doesn’t survive the round-trip out to Stripe Checkout and back.
A webhook that flips the plan in the database a beat after the UI has already re-fetched, so the user sees the old plan.
The cookie set on sign-in that silently fails to carry across a cross-page navigation.

The bug lives in the composition of the full stack, so only a real browser driving the assembled app can see it.

Everything in the left tab is a bug you can only observe by running the whole machine. Everything in the right tab is a bug that lives inside one layer, where a test ten to a thousand times cheaper is already looking, and looking without the flake surface a browser drags in. That’s the dividing line, and it’s not a matter of taste. It’s a hard fact about what each kind of test can physically reach.

The 20–30 path shape

“Off by default” does not mean “never write one.” It means the bar is high, and a real app clears it a knowable number of times. Sizing that number gives you something to calibrate against, and a way to notice when you’ve drifted.

A mid-stage SaaS has under thirty money paths in total. Treat thirty as a ceiling you’d be surprised to reach, not a target to fill.

The first ten or so are universal, the ones every SaaS on this stack has. Sign-in, with whichever methods you offer. Sign-out. The checkout redirect out to Stripe. The return from Stripe and the plan flip the user sees. Invitation acceptance with its seat grant. Password reset. And the one primary “create-the-thing-customers-pay-for” path your product is built around. The next ten are app-specific: the CSV export the customer bought the higher plan to get, the report an auditor needs, whatever else sits directly on the revenue.

Past thirty, something has gone wrong. You’ve drifted off money paths and into coverage-chasing, testing flows because they exist rather than because failure costs money. That’s the reflex from the opening creeping back in through the side door.

The next lesson walks the four canonical paths in this course’s app in real detail: sign-in, the Stripe checkout round-trip, invitation acceptance, and the primary value loop. Here, just hold the shape: a short list, capped low, with every entry sitting on money.

Year-one zero is the correct default

Now the honest part, and it’s permission more than advice. A small team shipping fast on this stack is correct to ship its entire first year with zero Playwright tests, given the disciplined integration suite you built last chapter and a production observability surface you’ll wire up later in the course. Not behind. Not cutting corners. Correct.

Sit with that, because it runs against every instinct the test pyramid trained into you. The risk is still covered, just not by end-to-end tests. The integration suite catches the bugs at the seam, which is where they cluster. Production observability, meaning error tracking and alerting, catches the unknown-unknowns, the failures nobody thought to write a test for. And a human clicking through the app before a release catches the obvious rest. That stack covers a young SaaS’s risk honestly, at a runtime cost the team can actually afford while it’s still finding product-market fit.

The trigger language you just learned isn’t a backlog you’re behind on. It tells the team when to start, not that they should already have started. The day the Stripe checkout path ships, or the day sign-in becomes the only door to a paid product, a money-path trigger fires, and then you reach for Playwright, for that path, with intent.

This is the same stance the previous chapter took on component tests, which is no coincidence: it’s one coherent testing philosophy applied band by band, top to bottom. The natural trajectory is year-one zero, year-two a small handful. A team starting end-to-end in year two reaches first for sign-in and Stripe checkout, the two highest-stakes money paths, and adds the others as it outgrows verifying them by hand every release. The four-path catalog of the next lesson is the destination, not the day-one count. So when an experienced engineer reviews a teammate’s first Playwright pull request, the note is almost always the same: fewer tests, better chosen, not more.

Why a production build, not `next dev`

One concrete rule earns its own section, because it’s a specific, high-stakes mistake and it’s the single place a line of code helps you here. When you do reach for Playwright, it must drive a production build of your app, never the dev server.

The reason is that next dev runs different code than the app your users get. Dev mode skips static optimization, injects dev-only error overlays, behaves differently in middleware, and serves unminified, un-bundled hydration. None of that exists in production. So a test that passes against next dev is asserting on output your users will never see, and the gap between the two is exactly where the bug hides. A test that’s green against the dev server and red against the real build, or the reverse, isn’t a test problem. It’s the bug shipping, with your test cheering it on.

So from the very first end-to-end test, Playwright builds and starts the production app and drives that. One field in Playwright’s config wires it up. You’ll meet the full config next lesson; this is just the line that matters, so the rule has a shape in your head:

webServer: {
  command: 'pnpm build && pnpm start', // production build, never `next dev`
  url: 'http://localhost:3000',
},

That command is the whole point of the fragment: it builds, then starts, the same two commands you’d run to deploy. Everything else about the config waits for the next lesson.

The gate before you write the test

Now the pieces fuse together. The filter (money, identity, data), the cheaper-test check, the can-Playwright-drive-it check, and the determinism check combine into one short, ordered procedure you run before your fingers reach the keyboard. The order is the design. You ask the cheap disqualifiers first, so that most candidates stop on question one or two and never reach the expensive judgment call. That’s why this takes five seconds and not five minutes.

Is this a money path under the filter? Failure means money moves wrong, identity breaks, or unrecoverable data is lost or exposed. Stop if no, which is the overwhelmingly common answer.
Do an integration test and a component test already compose to catch the same bug? Composition is the only justification, so if the cheaper layers already cover it between them, the browser adds cost and flake rather than coverage. Stop if yes.
Can Playwright actually drive the third party this path crosses? Think Stripe Checkout in test mode, or an OAuth provider with a test account. Stop if no, and write the seam test to assert the contract instead.
Will the test be deterministic without sleeps? Stop if no, and fix the app or the test first, because a test that needs a sleep is hiding a race.

Reading the list isn’t the same as running it. Walk the gate below for a real candidate. Answer each question and watch where it drops you. Most paths stop you cold long before the last question, and that is the lesson: the gate’s job is to turn you away.

Before you write this end-to-end test

Notice how rarely you reach the last node. That’s the gate working exactly as intended. And it really is a five-second mental gate, not a form you fill out: it runs in the time it takes to start typing the test file’s name. Internalize the order and you’ll find yourself turning candidates away without consciously listing the questions.

Sort the paths: which earn a Playwright test?

Reading the filter and running it are different skills, and the second is the one that matters in review. Each item below is a path you’d plausibly meet in a 2026 SaaS. Some are clean money paths that need the whole composition, and some are tempting picks that already have a cheaper home. Drop each one into its bucket.

Each item is a path in a typical 2026 SaaS. Does it cross the money-path filter and need the full composition — or does a cheaper layer already own it? Drag each item into the bucket it belongs to, then press Check.

Write a Playwright test A money path, catchable only in the full composition

Don't — a cheaper layer owns it No money-path trigger, or a cheaper test already covers it

Sign-in to a paid dashboard with email and password

The Stripe Checkout redirect and the plan flip the user sees on return

Accepting an org invitation and landing in the right org with the right role

Create an invoice, the recipient pays, the invoice flips to paid in the UI

A Zod validation branch — the error that shows on an empty email field

A Server Action writing a new row to Postgres

Webhook signature verification rejecting an unsigned payload

The settings page rendering the signed-in user’s name

The marketing landing page rendering for a logged-out visitor

A button exposing the right accessible name to a screen reader

If a “skip” item tempted you, look at where the bug would live. A validation branch, a database write, a signature check, an accessible name: each is inside a single layer, and each has a faster, steadier test already watching it. The four “reach” items are the ones where every layer has to align for the customer to get what they paid for. Those are the money paths.

Run the judgments one more time

Each claim is about when a Playwright test earns its runtime cost on a 2026 Next.js SaaS. Mark each statement True or False.

For a 2026 Next.js SaaS, end-to-end tests are on by default, and you delete the ones that don’t earn their weight.

Inverted. They’re off by default — you add one only when the money-path filter fires, not subtract from a default-on suite.

A small SaaS with disciplined integration tests and production observability can correctly ship its first year with zero Playwright tests.

Year-one zero is the honest default. The integration suite covers the seam, observability covers the unknowns, and the trigger language tells the team when to start — not that they’re behind.

Running end-to-end tests against next dev is fine, and it’s faster than building the production app.

Dev mode runs different code paths. A pass against next dev can be a fail against the real build — the mismatch is exactly where the bug ships. Playwright must drive a production build.

When an end-to-end test flakes intermittently, bumping retries to 3 to keep CI green is the right fix.

It hides a real race. The flaky test is telling you something true about your app; a retry bump mutes the alarm. Flake gets a structural fix — better locators, auto-waiting assertions, deterministic data.

The only thing that justifies an end-to-end test over cheaper tests is a bug that lives in the composition of the full stack.

If an integration test and a component test compose to catch it, the browser adds cost and flake, not coverage. Composition is the one thing only a browser-driven test can see.

External resources

The canonical reference under this whole chapter is Playwright’s own best-practices guide, so read it before the mechanics arrive next lesson. The other two below sharpen this lesson’s central claim: a test strategy that places end-to-end at the thin top, aimed only at the few paths that matter.

Playwright — Best Practices

playwright.dev

Its 'test user-visible behavior' and 'avoid testing third-party dependencies' headings back this lesson directly — the reference the rest of the chapter builds on.

Pyramid or Crab? Find a testing strategy that fits

web.dev

web.dev on choosing a test strategy by context — places E2E at the thin top and argues for it 'only for the most critical test cases,' the exact shape of this filter.

Testing Trophy and Testing Classifications

kentcdodds.com

The fewer-higher-leverage-tests thinking this lesson specializes to the top band — why the E2E layer is the smallest one.