Chapter 86Lesson 4

Arrange, act, assert one behavior

Your introduction to writing Vitest unit tests, structuring each one as arrange, act, assert and asserting on observable behavior rather than implementation.

You have written this test before, and it passes. Tomorrow a teammate renames an internal helper, or inlines it into the function that called it. No behavior changed, because the same inputs still produce the same outputs. But the test goes red. The author opens it, finds it asserting on the renamed helper’s call arguments, patches the assertion, reruns, green. Three weeks later, the same function forces the same dance. By the fourth time, the team draws the only conclusion the evidence supports: this test lies. They stop reading it, and they trust the ones next to it a little less too.

That test had a bug, and the bug wasn’t in the assertion you kept patching. The test was coupled to how the code did its work instead of what it did for the caller. This lesson gives you the shape that avoids that coupling. By the end you’ll write a test as Arrange / Act / Assert, covering one behavior, under a name that reads as that behavior, asserting only what the caller can observe. Earlier you saw that tests go where the bugs are; the rule here is sharper. A test is only worth writing if it fails when the bug ships and stays green every other time. Everything below answers one question: how do you stop writing the test that cried wolf.

Arrange, act, assert: the three-part shape

Start with the mechanics, because they’re concrete and a reviewer can grade them at a glance. Every test does three things in the same order, every time.

Arrange builds the inputs and fixtures the test needs. Act invokes the unit under test, once. Assert verifies what came back. The blank line between each section is not decoration: it’s the convention that lets a reviewer read the three beats without parsing the code. Three paragraphs, three jobs, always in that order.

The cleanest place to see the rhythm is a pure function, because a pure function has the simplest observable surface there is. It takes inputs and returns a value, or throws, and nothing else happens. Here is a test for mapError, the dispatch you wrote earlier in lib/error-mapping.ts that takes any thrown error and returns the Result your wrappers hand back, keyed by a stable code so a ZodError always becomes a 'validation' failure.

import { describe, it, expect } from 'vitest';
import { ZodError } from 'zod';
import { mapError } from './error-mapping';

describe('mapError', () => {
  it('maps a ZodError to a validation failure', () => {
    const error = new ZodError([]);

    const result = mapError(error);

    expect(result).toMatchObject({
      ok: false,
      error: { code: 'validation' },
    });
  });
});

The skeleton. The import line pulls describe / it / expect from 'vitest' explicitly, because the project runs with globals off: there are no ambient test functions, so every file names what it uses. describe groups the tests for one unit, and it holds one specific behavior.

import { describe, it, expect } from 'vitest';
import { ZodError } from 'zod';
import { mapError } from './error-mapping';

describe('mapError', () => {
  it('maps a ZodError to a validation failure', () => {
    const error = new ZodError([]);

    const result = mapError(error);

    expect(result).toMatchObject({
      ok: false,
      error: { code: 'validation' },
    });
  });
});

Arrange. Build the one input this behavior needs: a ZodError, the schema-failure case. Nothing is asserted here yet; this section only sets the stage.

import { describe, it, expect } from 'vitest';
import { ZodError } from 'zod';
import { mapError } from './error-mapping';

describe('mapError', () => {
  it('maps a ZodError to a validation failure', () => {
    const error = new ZodError([]);

    const result = mapError(error);

    expect(result).toMatchObject({
      ok: false,
      error: { code: 'validation' },
    });
  });
});

Act. Call the unit under test exactly once and capture the result. There is one Act per test: if you find yourself calling the function twice, you’re testing two things.

import { describe, it, expect } from 'vitest';
import { ZodError } from 'zod';
import { mapError } from './error-mapping';

describe('mapError', () => {
  it('maps a ZodError to a validation failure', () => {
    const error = new ZodError([]);

    const result = mapError(error);

    expect(result).toMatchObject({
      ok: false,
      error: { code: 'validation' },
    });
  });
});

Assert. Check the observable outcome: the Result the caller branches on, and its code. toMatchObject asserts the fields that matter without pinning the userMessage or fieldErrors the object also carries.

1 / 1

Once you’ve internalized the three beats, the malformed shapes become easy to spot. An expect call sitting in the Arrange section, like expect(error.issues).toHaveLength(0) before the Act, means you’re testing your own fixture rather than the function. Two Act-and-Assert pairs in one it means two behaviors crammed into one test, so split them. And a test with no Arrange at all, where the Act reaches straight for some global, is usually hiding a missing fixture. None of these are style nitpicks. Each one is the shape telling you something about the test is off.

One behavior per test, named for the behavior

The shape says one behavior per test, which raises the question it depends on: what counts as one behavior? You can’t answer that without also answering what makes a good test name, because a good name is just the behavior stated in words. So we’ll do both at once.

A behavior is one thing the caller observes: one return shape, one branch taken, one side effect, or one thrown error. The key point is that one behavior can need several assertions. When createInvoice returns the row it just created, checking its id, its status, and its total is three expect calls describing a single behavior, “it returns the created invoice.” That’s one test. But when a function returns data and writes an audit-log row and throws on bad input, those are three different things the caller observes, and bundling them under one it gives you three tests wearing one name. To tell the cases apart, ask a question of the name: if this test fails, can a reader name the single broken behavior from the name alone? If the honest answer is “it depends which assertion failed,” you have more than one behavior.

Now sort these. Each item describes what a single it block asserts. Decide whether it’s one behavior, which you keep as one test, or several behaviors that should each get their own.

Each item describes what one `it` block asserts. Decide whether it's a single behavior that can keep its many assertions, or several behaviors hiding in one test. Drag each item into the bucket it belongs to, then press Check.

One behavior Keep as a single test

Several behaviors Split into separate tests

Asserts the returned invoice’s id, status, and total

Asserts a 403 is returned when the role is below admin

Asserts the rejected error’s code and message on a duplicate email

Asserts the function returns data and that it wrote an audit-log row

Asserts both the success result and the error thrown on bad input

Asserts the response status, then re-calls with a bad body and asserts a 400

Now the name. The course’s pattern, the same one in the code conventions, is it('<observable outcome> when <conditions>'), in the present tense. The describe carries the unit, and the it carries the one behavior. Read them together out loud and you should get a sentence a teammate can parse on the pull request without opening the source:

describe('safeLimit', () => {
  it('allows the request through when Redis is unreachable', () => { /* ... */ });
  it('blocks the request when the quota is exhausted', () => { /* ... */ });
});

“safeLimit allows the request through when Redis is unreachable” is the fail-open carve-out from the rate limiter you built, stated as a sentence. Compare it against names that say nothing: 'works', 'works correctly', 'returns the right value', 'handles the case', 'test 1'. Each of those names a test that exists, not a behavior that’s guaranteed. When one fails, the report tells you 'works' is broken, which helps no one. If you catch one of these in your own diff, the fix is mechanical: replace it with the outcome and the condition.

Test the behavior, not the implementation

Everything so far is the easy part: visible, gradable, hard to get wrong once you’ve seen it. This section is the part the opening story was really about, and it’s where tests quietly rot.

The rule, stated plainly, is to assert on what the caller observes: the return value, the thrown error, the database row that got written, the HTTP status and body. Do not assert on which private helper got called, the internal data structure mid-flight, the order of the queries the function runs, or the language a regex is written in. A test that checks “the function called _buildQuery(args)” is bolted to the function’s internal structure, so inlining _buildQuery breaks the test even though every input still produces the identical output. A test that checks “returns rows ordered by createdAt descending” is bolted to the contract, and that same refactor leaves it green, because the contract didn’t move.

Here is the reflex that decides it for you, and it’s worth holding onto because the rest of this lesson keeps coming back to it. The black-box thought experiment: swap the implementation for a different one that satisfies the same contract, and ask whether the test still passes. If yes, you’re testing behavior. If no, you’re testing implementation. Run this in your head while deciding what to assert, not as an afterthought.

The trap is that the implementation-coupled test rarely looks wrong. It looks like something a competent developer writes on autopilot. Watch the same scenario tested two ways. The function is safeLimit, which wraps limiter.limit(key) and has to fail open when Redis is unreachable so an outage doesn’t lock every user out.

Couples to implementation
Asserts the contract

import { describe, it, expect, vi } from 'vitest';
import { safeLimit, signInLimiter } from '@/lib/rate-limit';

describe('safeLimit', () => {
  it('allows the request through when Redis is unreachable', async () => {
    const spy = vi
      .spyOn(signInLimiter, 'limit')
      .mockRejectedValue(new Error('ECONNREFUSED'));

    await safeLimit(signInLimiter, 'ip:1.2.3.4');

    expect(spy).toHaveBeenCalledWith('ip:1.2.3.4');
  });
});

This breaks the moment the function is refactored to call its collaborator differently, even when the output is byte-for-byte identical. It asserts the wiring, that limiter.limit was called with that key, not the decision the caller acts on. Route through a cache first, or change how the key is passed, and it goes red on a refactor that changed no behavior.

import { describe, it, expect, vi } from 'vitest';
import { safeLimit, signInLimiter } from '@/lib/rate-limit';

describe('safeLimit', () => {
  it('allows the request through when Redis is unreachable', async () => {
    vi.spyOn(signInLimiter, 'limit').mockRejectedValue(new Error('ECONNREFUSED'));

    const result = await safeLimit(signInLimiter, 'ip:1.2.3.4');

    expect(result).toMatchObject({ success: true });
  });
});

This survives any refactor that still produces the same decision. It asserts what the caller acts on: the request is allowed through, with success: true as the fail-open verdict. The spy still sets up the failure, which is legitimate Arrange, but the assertion is on the observable result, never on the spy.

Notice what changed and what didn’t. Both tests use a spy to arrange the failure, which is fine: you need a way to make the limiter throw on demand. The difference is the assertion. The first asserts that limiter.limit was called with a particular key, which is the function’s internal plumbing. The second asserts that the request came back allowed, which is the thing the caller depends on. Run the black-box test on each by replacing safeLimit with a rewrite that consults a local cache before touching Redis. The contract test passes, still allowed when Redis is unreachable. The implementation test fails, because limit wasn’t called the way it expected. The behavior is identical, yet only one test noticed, and it noticed the wrong thing.

This is the real shape of the spy smell , and it’s worth naming precisely so you don’t overcorrect. The problem is not that mocks are bad. Mocks are how you arrange a dependency you can’t trigger for real. The problem is asserting that your mock was called with the arguments you fed it: toHaveBeenCalledWith on a value you set up two lines earlier. That assertion verifies your own test setup, then reports it as if it had tested the function.

The same logic governs how you stub external dependencies, and it has a name: mock the network, not the function. When a test needs a third-party call stubbed, say the function fetches an invoice from an external service, the stub belongs at the seam where your code meets the wire, not at the function calling it. Mocking the function (“the code called fetchInvoice”) couples to a name, so renaming it to loadInvoice breaks the test. Mocking the network (“a GET /invoices/:id returns this body”) couples to the contract, and survives any rename on your side. It’s the black-box experiment applied to dependencies. The machinery for stubbing the wire, MSW, is built two chapters on, in the integration-testing chapter; here you only need the principle.

What “observable” means at each layer

You now have the shape and the rule on a pure function, where “observable” means the return value. The rule doesn’t change as you move up the stack, but the shape of “observable” does. The seams you built across the app each have their own observable surface, and the whole skill is pointing the assertion at that surface and never below it.

The following table maps each layer to what a test should assert against it. Read each row as: for this kind of code, this is the surface the caller depends on, so assert there.

Layer

What the test asserts — the observable surface

Pure function

Return valueThrown error

Async function

Resolved valueRejected errorSide effects on injected deps

Integration layer Server Action

ResultRow written to the test DBAudit-log entryerror.tsx

Route handler

HTTP statusBody (RFC 9457 shape)Headers

Webhook receiver

HTTP statusprocessed_eventsSide effects on business tables

Component later

Rendered textControls presentThe effect of a click

Every layer exposes a surface its caller depends on. Point the assertion there, never at the machinery below it.

The middle of that table covers Server Actions, route handlers, webhook receivers, plus safeLimit and error.tsx. This is the integration layer, where the bugs in a SaaS like this actually cluster, and the table tells you where to point the assertion for each of those seams. In practice each seam earns two tests, not one: the observable success path, and its fail-closed branch, which is the 403, the rejected body, or the error the framework catches. Hold to that two-tests-per-seam habit; the depth of unhappy-path testing comes later, in the next chapter.

One boundary on the rule deserves its own note, because it’s the most common way to overshoot: don’t test the framework. A unit test that drives Next.js’s render pipeline for a Server Component is testing Vercel’s code, not yours, and it’ll break when the framework changes under you for reasons that have nothing to do with your behavior. Your tests stop at the framework boundary. Assert the Server Action body, the data-fetching helper, and the validator, not <Link>, redirect(), notFound(), or whether page.tsx rendered. Those are the framework’s contracts to keep, and the framework’s own tests keep them.

Choosing the matcher: assertion failures are documentation

This is a small section with an outsized payoff. The test name tells you which behavior broke. The assertion failure should tell you how it broke, and that depends entirely on which matcher you reach for.

Compare two failures. The first asserts only the discriminant flag, and it fails with expected false to be true. That message is correct but useless: it tells you the flag flipped, but nothing about why.

expect(result.ok).toBe(true);

The second asserts the shape, so when it fails because status came back 'draft', the diff names that exact field.

expect(result).toMatchObject({
  ok: true,
  data: { id: expect.any(String), status: 'paid' },
});

You learn what broke from the failure alone, with no opening the source and no reaching for the debugger. That’s the whole reason to pick the matcher deliberately rather than reaching for toBe by reflex.

The course leans on a small, deliberate set:

toBe for primitives and identity, as in expect(total).toBe(0).
toEqual for deep value equality across a whole structure.
toMatchObject for partial-shape matching, the workhorse for Result values and database rows. Assert the fields the caller depends on, not every field the row happens to carry.
toContainEqual for “this item is somewhere in the array.”
toThrow for the error path.
expect.any(String) and expect.objectContaining(...) for fields that legitimately vary, like generated IDs and timestamps, so the test doesn’t pin a value that changes every run.

That last one is a stopgap, not the real fix. When an ID or a timestamp varies, expect.any(String) keeps the test green, but the durable answer is to make the value deterministic by pinning the clock and the ID generator, which the next chapter covers as its own topic. For now, match the varying field loosely and assert the fields that don’t.

Snapshots deserve one note, because they’re the matcher most easily misused. A snapshot is a behavior assertion only when it captures a contract the caller actually depends on, such as a rendered email template or an RFC 9457 body shape. It becomes an implementation assertion the moment it captures whatever the function happened to return today. The tell is churn: if a snapshot needs updating every other pull request, it’s pinned to implementation, and every update quietly retrains the team to hit “approve” without reading. The course uses snapshots for email-template output and RFC 9457 shapes, and essentially nowhere else.

Read the test, not the source

Here’s where all of this pays off, and the way you’ll actually use it day to day: not while writing your own tests, but while reviewing someone else’s.

The reflex is to read only the test file, source closed, and ask what the unit does, what its behaviors are, and whether you could reimplement it from these tests alone. If you can, the tests are documentation, anchored to behavior and readable by the next engineer who’s never seen the code. If the test file reads like a transcript of the implementation, naming private helpers, asserting call order, and mirroring the source line for line, the rule has slipped and you flag it. It’s the same lens you’ve been applying to coverage: ask what would have to change for this test to fail meaningfully. If the only answer is “you’d have to delete the test,” it’s not testing anything.

This is also why test names earn their keep. Vitest’s reporter lists every name. Run vitest run --reporter=verbose and the output is a behavior catalog of the unit, the thing a new engineer reads first when something breaks, before opening a single source file. Names that read as behaviors turn that report into documentation; names like 'works' turn it into noise.

So review this one. It’s a pull request adding a test file for safeLimit. Read it the way you’d review it for a teammate, source closed, asking only whether each test describes a behavior or an implementation. Leave an inline comment on every line where the shape has slipped.

Review this test file the way you would on a real PR — read only the tests and flag anything coupled to implementation, bundling two behaviors, or named for nothing. Click any line to leave a review comment, then press Submit review.

src/lib/rate-limit.test.ts

import { describe, it, expect, vi } from 'vitest';
import { safeLimit, signInLimiter } from '@/lib/rate-limit';

describe('safeLimit', () => {
  it('allows the request through when Redis is unreachable', async () => {
    const spy = vi
      .spyOn(signInLimiter, 'limit')
      .mockRejectedValue(new Error('ECONNREFUSED'));

    await safeLimit(signInLimiter, 'ip:1.2.3.4');

    expect(spy).toHaveBeenCalledWith('ip:1.2.3.4');
  });

  it('works', async () => {
    const allowed = await safeLimit(signInLimiter, 'ip:9.9.9.9');
    expect(allowed.success).toBe(true);

    vi.spyOn(signInLimiter, 'limit').mockResolvedValue({ success: false } as never);
    const blocked = await safeLimit(signInLimiter, 'ip:9.9.9.9');
    expect(blocked.success).toBe(false);
  });
});

Four plants, four labels, one defect. The spy assertion couples to plumbing; the empty name describes nothing; the toBe(true) reports a flipped bit instead of a diff; the bundled it hides two behaviors under one outcome. Underneath, every one of them is a test written around the implementation instead of the behavior, a test that can’t tell a reviewer, from the file alone, what the unit promises and when it would break. Learn to see that shape once and you’ll see it everywhere, in your own diffs and everyone else’s.

External resources

The writeups that put behavior-over-implementation on the map are worth reading once, and the Vitest matcher reference is the page you’ll keep open while writing assertions.

Vitest — Expect / matchers

vitest.dev

The full matcher surface — the page to keep open while choosing how an assertion should fail.

Testing Implementation Details

kentcdodds.com

Kent C. Dodds on why implementation-coupled tests give false negatives on refactors and false positives on real bugs.

Martin Fowler — UnitTest

martinfowler.com

The durable reference on what a unit test is — solitary vs sociable, and asserting on observable behavior.