Chapter 86Lesson 2

The honeycomb shape for a Next.js SaaS

The testing honeycomb, an integration-centered way to shape a test suite that concentrates effort at the seams where a Next.js SaaS actually breaks.

The runner is wired and green on an empty suite. The team is about to write its first hundred tests, and the choice it makes now decides whether those tests earn their keep or quietly rot. The tempting move is the obvious one: open a “testing best practices” article, find the test pyramid sitting at the top of it, and start cranking out unit tests because the diagram says the base should be wide. Six months later you have four hundred green tests, a bug in production that none of them caught, and a team that has learned to merge through a red suite because the suite is usually wrong about what matters.

The way to avoid that future is to stop asking “what’s the right test shape” and start asking “where does this kind of system actually break”, then let the answer pick the shape. That reframe is the whole lesson, and it has a name we’ll keep coming back to: shape follows the bug. By the end you’ll be able to look at any piece of this SaaS, whether a validator, a Server Action, a button, or the Checkout flow, and say which test layer earns the test, whether it earns one at all, and why. This is the same discipline you’ve already met twice: in the chapter on TanStack Query and Zustand, where a tool stays off until a named threshold pushes past the platform default, and in the chapter on Row-Level Security, where the database layer waits for a real reason before you reach for it. Here that discipline comes back at the level of the whole suite.

The four bug layers of a Next.js SaaS

Before we can talk about shape, we need to talk about where the bugs live, and that conversation only works if it’s about your codebase, not a generic chart. So forget testing for a moment. Where can a bug actually hide in a Next.js 16 SaaS like the one you’ve been building?

There are four places, and you’ve already written code in every one of them.

The first is pure logic in /lib: your Zod validators, your data mappers, the RFC 9457 error-code mapper, the redactor that strips secrets from logs, the Temporal codecs that turn an Instant into a database string and back. This code is deterministic: same input, same output, no database, no network, no framework. A bug here is a wrong return value.

The second is the seams , the places where your code stops being a pure function and starts talking to the outside world. These are your Server Actions wrapped in authedAction, your route handlers wrapped in authedRoute, your webhook receivers, your Drizzle query helpers, and your rate limiter safeLimit. A bug here isn’t a wrong number; it’s a query that returns another tenant’s data, or a webhook that trusts an unsigned body.

The third is components, the presentational and interactive UI rendered into trees that the framework mostly owns. A bug here is a button that doesn’t disable while submitting, or a date picker that returns the wrong range.

The fourth is the end-to-end money paths: sign-in, Stripe Checkout, accepting an invitation. These aren’t single functions; they’re multi-step flows that cross the entire stack, and when one breaks, it doesn’t return a wrong value, it costs real money.

E2E money paths

sign-in · Stripe Checkout · invitation accept

Components

buttons, forms, the date-range picker

The seams

authedAction · authedRoute · webhook receiver · Drizzle helpers · safeLimit

Pure logic in /lib

Zod validators · error-code mapper · redactor · Temporal codecs

The four places a bug can live in a Next.js SaaS. Equal bands here; sizing comes later.

Hold onto two phrases from that list: the /lib surface and the seams. We’ll lean on both for the rest of the unit. The question the whole lesson answers is now sharp: given these four layers, where do you concentrate your tests?

Why the test pyramid is the wrong default here

The internet has an answer ready for you, and it’s the most-repeated piece of testing advice there is: the test pyramid. Many unit tests at the wide base, fewer integration tests in the middle, a tiny cap of end-to-end tests at the top. You’ll meet this diagram everywhere, so it deserves a fair hearing rather than a dismissal.

The pyramid is genuinely correct, but only for the right system. Picture a banking back-end, a billing engine, or a physics simulation: software where most of the behavior is computation you fully own, deep and intricate and framework-independent. Think interest accrual, tax rules, a pricing model with forty edge cases. In that world the bugs live inside the logic, the logic is pure, and pure logic is exactly what unit tests are best at. Hundreds of fast unit tests at the base genuinely catch most of the bugs, because most of the system genuinely is unit-testable. The pyramid isn’t wrong; it’s a faithful map of where bugs land in a deep-logic system.

A Next.js SaaS is not that system, and this is the pivot worth sitting with. Walk through one of your own features end to end, say creating an invoice. What’s the actual business logic you wrote? A Zod schema to validate the form, a mapper to shape the row, a Drizzle insert, and a cache tag to bust. That’s it. The logic is shallow: a validator, a mapper, a query, a write. Everything around it that feels like “the app”, rendering the page, routing the request, plumbing the Server Action, caching the result, streaming the response, you didn’t write. The framework owns the orchestration. The depth in your application doesn’t live in a logic core, because there barely is one; it lives at the boundaries, where your thin slices of logic meet Postgres, Better Auth, Stripe, and Resend.

Now apply the pyramid to that system anyway. You follow the diagram, you make the base wide, and you end up with a developer writing this kind of thing:

it('formats 1000 cents as $10.00', () => {
  expect(formatMoney(1000)).toBe('$10.00');
});

it('formats 2500 cents as $25.00', () => {
  expect(formatMoney(2500)).toBe('$25.00');
});

it('formats 0 cents as $0.00', () => {
  expect(formatMoney(0)).toBe('$0.00');
});

Ten variations of formatMoney, two hundred and fifty green tests, ninety percent line coverage. And shipped that same week: a Drizzle query that forgot its orgId filter, quietly serving one customer’s invoices to another. There were zero tests at the layer that would have caught it, because the pyramid pointed all the effort at the base while the bug sat at a seam the base never touches. The effort went where the diagram said the bugs were; the bug went where the diagram wasn’t looking.

That’s the failure in one line: the shape of your suite should track the shape of your bug density. The pyramid encodes a deep-logic bug density, but this architecture’s bug density is boundary-heavy. Use the wrong map and you do real work that finds nothing.

A team has 250 passing unit tests and 90% line coverage. This week a webhook receiver shipped that calls JSON.parse on the request body before it checks the HMAC signature — and it sailed into production unnoticed. What does this most likely tell you?

The team has been careless about coverage and should push it from 90% toward 100% before shipping again.

Their tests are concentrated in the wrong place — the layer that finds this class of bug is barely tested.

Webhook receivers run outside your control, so there’s no realistic way to catch this before production.

The pyramid let them down because no architecture should ever organise its tests as a pyramid.

The honeycomb, and why it fits

If the pyramid is the wrong shape, what’s the right one? There’s a named answer, and it comes from exactly the kind of system that looks like yours.

In 2018 Spotify’s engineering team published the testing honeycomb, later catalogued in Martin Fowler’s writing on testing shapes, for a world of microservices where most of what any single service does is talk to other systems. Their figure puts the center of gravity not at the base but in the middle: integration tests are the widest band, with thin layers above and below. The reasoning is the same one we just walked through: when most of your behavior is interaction with external systems, the tests with the highest value per test are the ones that exercise those interactions.

One honest caveat about the source. Spotify’s literal figure is a three-band hexagon, drawn for microservices. What this course teaches is a four-band adaptation of it for a Next.js SaaS, unit, integration, component, and E2E, because a SaaS has a real UI layer that a backend microservice doesn’t. So when we say “the honeycomb,” read it as “the honeycomb shape, adapted for this stack,” not a claim that Spotify drew four bands. The borrowed idea is the center of gravity, not the exact silhouette.

Why does a 2026 Next.js SaaS fit the mold so cleanly? Run down the stack: Server Components and Server Actions are orchestrated by the framework, not by you. The database is external, Postgres reached through Drizzle. Stripe is external. Resend is external. Auth is a library sitting at a boundary. Almost every interesting thing your application does is an interaction with one of those, which means the tests worth writing live at the seams where those interactions happen. The shape isn’t a preference; it falls straight out of where the system’s behavior actually is.

It helps to place the honeycomb next to its two neighbors, because you’ll hear all three names and you should be able to tell them apart.

The pyramid you already know: right for a deep, framework-independent logic core, wrong here.

The testing trophy, from Kent C. Dodds, comes with a memorable slogan: “write tests, not too many, mostly integration.” Notice that last word: the trophy agrees with the honeycomb that integration is the center of gravity. A common misreading is that the trophy is “the one with a fat component layer.” It isn’t; its emphasis is integration too. What distinguishes it is a visible static base, your TypeScript types and your linter, counted as the first line of defense under the unit layer, plus a framing aimed at the JavaScript front-end. The course pins to the honeycomb name rather than the trophy, not because the trophy is wrong, but because this SaaS’s logic and seams live on the server, where the honeycomb’s microservice heritage fits more snugly than the trophy’s client-app framing. Two shapes, the same instinct about integration, slightly different home turf.

The honeycomb, then, is the senior pick for this stack: integration-centered, because the bugs are boundary-centered.

Test pyramid (wrong fit)
Honeycomb (this SaaS)

E2E

Integration

Unit

Optimizes for a deep logic core, where most bugs sit inside pure functions.

One guardrail before we make this concrete, because it’s the single most common way people misread the honeycomb. The shape names where tests live, not how many of each. It’s a location heuristic, not a quota. A year-one SaaS might genuinely ship two hundred unit tests, eighty integration tests, zero component tests, and four end-to-end tests, and still be a perfect honeycomb, because the band weights follow the codebase rather than a target on a chart. A year-three version of the same app might see the integration count overtake the unit count as the seam surface grows. The shape tells you which layer earns a given test. How many tests each layer ends up with is a separate question, and it’s the one the next lesson takes apart under the name “coverage.”

What lives in each band

The shape is decided. Now make it actionable: for each band, which artifacts in your codebase belong there. This is the part you’ll come back to when you’re staring at a file wondering what to write.

Unit, the wide base. Every file in /lib ships a test: your Zod schemas, the RFC 9457 error-code mapper, the Temporal codecs, the redactor, every pure data transform. Add the type-level tests for the moves from the start of the course: a narrowing that has to hold, a branded ID that mustn’t accept a raw string, a discriminated union that must stay exhaustive. These tests are cheap to write, cheap to run, and need no fixtures, no database, no mocks. They’re the base for a reason: pure logic is where unit tests earn the most per line. The depth of this band, factories, determinism, and the unhappy path, is the next chapter’s job.

Integration, the center of gravity. This is where the honeycomb spends its weight, and you already have its catalog. In the chapter on fail-closed error discipline you walked six seams where your code meets the outside world. That list of seams is your integration-test list:

authedAction, the Server Action wrapper.
authedRoute, the route-handler wrapper.
requireOrgUser, the page-level access gate.
The webhook receiver.
safeLimit, the rate limiter.
The error.tsx boundaries.

Each seam earns coverage on the two branches that actually decide its behavior: its fail-closed branch (does it correctly refuse when it should?) and its message-split branch (does it return the right thing on each side of the decision?). On top of the six, every Drizzle query helper gets tested against a real test Postgres, with each test’s writes rolled back so the database stays clean, and any outbound HTTP call gets stubbed at the network boundary. The tools for that real-DB lifecycle and the network stubbing arrive in the chapter on integration testing; here, just register that this is where the weight goes.

One claim in this band trips up almost everyone, so let’s make it concrete:

export const archiveInvoice = authedAction(async ({ orgId, input }) => {
  const { id } = archiveInvoiceSchema.parse(input);
  const invoice = await db.transaction(async (tx) => {
    const row = await archiveInvoiceRow(tx, { id, orgId });
    await logAudit(tx, { event: 'invoice.archived', invoiceId: id });
    return row;
  });
  return ok(invoice);
});

The three highlighted lines each cross a different boundary: the Zod .parse validates untrusted input, the Drizzle helper writes to Postgres, and the audit insert records the event. A Server Action reads the session, parses the input, calls Drizzle, writes the audit log, and returns a Result. None of that is pure, and none of it is unit-testable in isolation, because there’s no standalone pure function to call. So testing it is testing the seam, by definition: the test has to read a session, parse the input, hit the database, and check the audit row, exercising the whole path. (If a non-trivial chunk of logic hides inside the action, extracting it into a pure function and unit-testing that is good practice, but the action itself still needs its seam test.) Webhook receivers are the same story: integration tests, never unit tests.

Component, thin. A component earns a test only when a named trigger is met, and that trigger, plus React Testing Library, is the subject of its own chapter. For now, treat this as a thin band: conditional, never the default. We’ll name the triggers precisely in a moment.

End-to-end, thinner. This band covers the handful of paths where failure costs real money: sign-in, Checkout, invitation accept, your primary value loop. Playwright and the trigger for reaching for it come later. The course holds one firm convention here, worth stating now: by year one you ship zero or four end-to-end tests, nothing in between. A half-built E2E suite flakes, and a flaky suite teaches the team to ignore red, which destroys the signal you built it for. Either commit to covering the money paths properly or don’t start.

Now drill the core skill of this lesson. Below are concrete pieces of the SaaS. Sort each into the layer that earns its test, and notice the bucket most people forget exists.

Sort each piece of the SaaS into the test layer that earns its test — and notice that one bucket is for things that earn no test at all. Drag each item into the bucket it belongs to, then press Check.

Unit Pure logic in /lib

Integration The seams

Component Only if a trigger is met

E2E Money paths

No test The framework's job, or behaviourless UI

A Zod schema’s .refine() rule

The RFC 9457 error-code mapper

A Temporal Instant → string codec

authedAction returning 403 when the role is below admin

The webhook receiver rejecting an unsigned body

A Drizzle query helper that must filter by orgId

safeLimit failing open on a Redis-auth error

A shared, complex stateful date-range picker

The Stripe Checkout flow

Sign-in

A presentational <Card> with no state

A page that just calls requireOrgUser() and renders a list

That “No test” bucket is the hardest lesson in this section, and we’ll come back to give it its own home. First, let’s make the integration band’s dominance feel earned rather than asserted.

The bug-density argument

You now know the shape and what fills each band. What you might not yet feel is why the integration band is so wide. Asserting “the bugs are at the seams” is one thing; recognizing the specific bugs is another. So here are the canonical seam bugs of a Next.js SaaS, every one drawn from material you’ve already worked through, and every one a real production incident waiting to happen:

The cross-tenant query that forgot its orgId filter, serving one customer’s data to another.
The Server Action that skipped authedAction entirely, with no session check, no role check, wide open.
The webhook receiver that parsed the body as JSON before verifying the HMAC signature, trusting input it hadn’t authenticated.
The cache tag that didn’t match its read tag, so a write left stale data on the screen.
The rate limiter that swallowed a Redis throw and let the request proceed when it should have failed closed.

Read that list back and notice the pattern: not one of them shows up in a unit test of a pure function. There is no pure function to call. Each one only surfaces when you run the real code path against a real test database with a real auth fixture, which is to say, in an integration test. The honeycomb’s wide middle isn’t a stylistic choice. It’s the suite positioned directly over the place the bugs land.

Which gives us the rule in its final, durable form, the one sentence to carry out of this lesson:

There’s a second axis worth naming, because it’s what turns “shape follows the bug” from a slogan into an optimization: cost. Each band costs a different amount to write and to run, and the honeycomb is the shape that catches the most bugs per unit of effort for this codebase’s bug distribution.

Cost per test

Bugs caught here

Unit

very low

modest

Integration

moderate

highest

Component

moderate-high

modest

E2E

very high

few — but costly

Value per test, by band. Integration sits at the sweet spot for this codebase's bug density: moderate cost, the highest bug yield.

To put numbers on the intuition: a unit test costs milliseconds to write and milliseconds to run, with no fixtures. An integration test costs minutes to write, since there are fixtures, a database to set up, and network calls to stub, then tens of milliseconds to run against the real DB. A component test costs minutes to write (DOM queries, async events) and hundreds of milliseconds to run under the jsdom overhead. An end-to-end test costs tens of minutes to write, seconds to run, carries real browser overhead, and brings flake risk. The honeycomb puts the bulk of its weight on the band whose cost is moderate but whose bug yield is highest, and keeps the expensive end-to-end band thin, reserved for the paths where a missed bug is measured in lost revenue rather than a stack trace.

When component and E2E tests earn their weight

The two thin bands need a rule of their own, because “thin” doesn’t mean “occasionally, on a hunch.” It means conditional: off by default, switched on only by a named trigger. This is the discipline you’ve already internalized: in the TanStack Query and Zustand chapter, a tool stays off until a threshold pushes past what the platform gives you for free; in the Row-Level Security chapter, the database-level control waits for a real reason. Same pattern, now at the suite level. The default for any piece of UI or any flow is “the unit or integration test already covers the logic underneath it.” Component and E2E tests have to earn their place against that default.

For component tests, three triggers justify the cost, and the chapter on component testing owns the depth of each: a piece from your shared component library that many callers depend on; a component with genuinely complex internal state; or a critical UX path where a silent break is unacceptable. Without one of those, the behavior under the component is already covered at the seam or the unit level, and a component test would only re-test what you’ve tested or re-test the framework.

For end-to-end tests, the trigger is a single sharp question: does failure cost money? Sign-in, Checkout, invitation accept, the primary value loop. The bar is not “is this user-facing”, because almost everything is user-facing. The bar is “would a silent break here lose revenue or lock users out.” That’s why some legitimate 2026 SaaS ship no end-to-end tests in year one and are right to: nothing in their early surface clears that bar yet.

The fastest way to internalize this is to walk the questions in the order a senior actually asks them. Work through the decision below for a piece of UI or a flow you have in mind. The questions cut from the most expensive verdict downward, which is the order that keeps you from over-testing.

Which test layer does this earn?

Notice what the walk trains: you ask about money first, triggers second, and only then about where the behavior lives. That ordering is the transferable skill, far more durable than memorizing which band any single artifact lands in.

What does not get a test

The walk ended on a verdict that deserves its own section, because it’s the one beginners get wrong most often: no test. Over-testing the wrong things is as much a failure as under-testing the right ones: it burns time, slows the suite, and produces tests that break on every refactor without ever catching a bug. Two categories earn no automated test, and naming them is as important as naming what does.

The first is the framework’s surface. Take a page that calls requireOrgUser() and renders a list. The routing, the server rendering, and the caching are Next.js’s job, and Vercel ships the Next.js test suite so you don’t have to. You test the data-fetching helper (unit or integration) and, if a trigger fires, the contract of what it renders (a component test), but you never write a test against <Link>, <Image>, redirect(), notFound(), or App Router segment behavior. Your tests stop at the framework boundary. Crossing it means re-testing code you didn’t write and can’t fix.

The second is UI plumbing with no behavior. A presentational component with no state, a <Card> that takes props and renders them, earns no test, because there’s no behavior to assert and nothing to break that a glance at the page wouldn’t catch. The narrow exception is a snapshot test, which pays off only when the snapshot captures a contract a caller genuinely depends on: the HTML of an email template, or the exact shape of an RFC 9457 response body. Snapshot every <Card> and you get a suite that demands a new snapshot every other PR, which means it’s testing implementation rather than behavior, and the team will start updating snapshots blind. (The depth of snapshots belongs to later chapters; the principle is what matters here.)

Which lands the bar for the whole lesson, stated plainly: “we have tests” is not the bar. “Do the tests fail on the bugs that ship” is the bar. A green suite that misses every seam bug isn’t safety; it’s theatre, and worse than no suite, because it manufactures confidence you haven’t earned. That same thread, a passing suite that proves nothing, is where the next lesson picks up, under the name of coverage.

A few statements to consolidate the model before we close:

Each claim is about where tests belong in a Next.js SaaS. Mark each statement True or False.

A presentational <Card> with no state should get a snapshot test.

False. There’s no behaviour to assert. Snapshots pay off only on a contract a caller depends on — an email template, an RFC 9457 response body — not on every presentational component. Snapshot every <Card> and the suite demands a new snapshot every other PR, testing implementation instead of behaviour.

A Server Action is an integration test, not a unit test.

True. It reads the session, parses the input, hits Drizzle, and writes the audit log — it isn’t pure, so testing it means exercising the whole seam. (Extracting a non-trivial inner pure function and unit-testing that is fine, but the action itself still needs its seam test.)

100 passing unit tests mean the seams are safe.

False. Unit tests exercise pure functions only. The seams — cross-tenant queries, webhook signature verification, fail-closed branches — surface only in integration tests run against a real test database with a real auth fixture.

Testing that <Link> navigates correctly is the app’s responsibility.

False. Routing is the framework’s surface. Vercel ships the Next.js test suite, so your tests stop at the framework boundary — never <Link>, <Image>, redirect(), or notFound().

Where this leaves us

The shape is decided, and the rest of the testing chapters fill it in. The next chapter builds the wide unit base over /lib: factories, determinism, type-level tests, the unhappy path. The chapter after that builds the integration center of gravity seam by seam, with the real test database, transaction rollback, and network stubbing that make those tests trustworthy. Then component tests arrive with their trigger, and end-to-end tests arrive with theirs, both conditional, both earning their place against the default. The whole stretch closes on a project: a layered test suite for the Stripe Checkout money path, unit through end-to-end.

The very next lesson sharpens the thread this one ended on. We said “we have tests” isn’t the bar, and that the suite has to fail on the bugs that ship. Coverage is the instrument people reach for to check that, and it’s the most misread number in testing. Next we’ll read it the way an experienced engineer does: as a diagnostic, not a target.

External resources

On the Diverse And Fantastical Shapes of Testing

martinfowler.com

Martin Fowler's catalog of the pyramid, the honeycomb, and the trophy: the canonical reference for the shapes compared in this lesson.

Testing of Microservices

engineering.atspotify.com

Spotify Engineering's original honeycomb post: the source this lesson adapts, and the clearest argument for an integration-heavy center of gravity.

The Testing Trophy and Testing Classifications

kentcdodds.com

Kent C. Dodds on the trophy: the neighbour shape that also centres integration, plus his definitions of unit vs integration vs E2E.

Pyramid or Crab? Find a testing strategy that fits

web.dev

web.dev compares pyramid, diamond, honeycomb, and trophy side by side and argues the shape should follow your architecture: the same thesis as this lesson.