Skip to content
Chapter 100Lesson 6

Rollback rehearsal and the schema caveat

The cadence is finished. Production sits on the target schema — subtotal and tax, no total — and across three reviewed PRs you never once let the running app and the live database disagree. There is one move left, and it builds nothing: you are going to break production on purpose, roll it back, watch the rollback fail to fix the real problem, then write down exactly what you saw so the engineer who inherits this app at 2 AM does not have to discover it the hard way.

That last clause is the whole lesson. An alias re-point swaps the running code in seconds, but the database stays on whatever schema your forward-only migrations already applied. Roll the code back against a schema it no longer fits and you have traded one outage for another. The only way to make that land — the only way it ever lands — is to do it once, with your own production, while nothing is at stake. So that is what you’ll do.

The only file that changes on disk is docs/runbooks/rollback.md. Everything else is a gesture on the Vercel dashboard and the verification that it did what you expected.

You are not changing production this time; you are rehearsing how to recover it, so the dashboard is not unfamiliar territory the first time an incident actually demands it. You will rehearse against the contract deployment deliberately, and the choice is the point: contract is the one move in the entire cadence whose schema change a rollback cannot reverse, which makes it the sharpest possible demonstration of the caveat. Promoting the previous deployment — the one that shipped right after PR 2 — restores code that reads total through the dual-read coalesce fall-through. But total is gone. For the few seconds before you re-promote, production raises a column total does not exist error, and that is not a mistake to avoid; it is the proof you came for. It doubles as a live test of your observability — Sentry must catch it, exactly as the launch checklist’s Sentry row promised it would.

This casualness is affordable only because the project has no live users. In a system that did, the same rehearsal would run inside a scheduled maintenance window against a throwaway deployment, never against the alias real traffic depends on — name that real-world version in your head, but you are not exercising it here. The runbook you write is not for you; it is for the on-call engineer who arrives at 2 AM with none of today’s context, and it must hand them the one distinction that decides everything: whether they are looking at an application bug — recovered by an alias re-point plus a code-only git revert, schema untouched — or a schema mistake, which an alias re-point cannot undo and which is recovered only by a forward-fix migration (for instance re-adding total as a GENERATED ALWAYS AS (subtotal + tax) STORED column — named in the runbook, never run here).

docs/runbooks/rollback.md carries the four-step alias re-point gesture, the git revert follow-up section, the re-enable-auto-assignment section, and the bolded “an alias re-point does not undo a migration” caveat.
tested
The runbook names both recovery paths: the code-only git revert path for an application bug, and the forward-fix migration (the GENERATED ALWAYS re-add of total) for a schema mistake.
tested
Promoting the previous (post-PR-2) production deployment flips the alias in seconds, confirmed by curl -sI returning the PR-2 x-vercel-id and the inspector’s build-source panel showing the PR-2 commit SHA.
untested
With the older code live against the contract schema, production raises a column total does not exist error and Sentry receives it.
untested
Auto-assignment is off after the promote, so the next merge to main will not silently re-ship the contract code until it is re-enabled.
untested
Re-promoting the contract (PR-3) deployment restores production to the target schema and code, the inspector showing the target shape and Sentry quiet after a refresh window.
untested
The launch checklist’s eight rows remain green at the URL.
untested

The test for this lesson can only reach the one artifact that lives on disk: it reads docs/runbooks/rollback.md as text and asserts its load-bearing structure. It cannot promote a deployment, read a Sentry event, or query Neon. Every live gesture is therefore yours to confirm by hand in Moment of truth.

Run the rehearsal against the brief, then write the runbook and restore production to the target state.

Reference walkthrough
  1. Open the Vercel dashboard for the project and go to Deployments. The current production deployment is the PR-3 (contract) merge.

  2. Find the previous production deployment — the one that shipped when PR 2 merged. Open its menu and choose Promote to Production.

  3. Watch the alias swap. It completes in under thirty seconds; no rebuild runs, because the deployment already exists. From a terminal, curl -sI https://<APP_URL> and read the x-vercel-id header, then open the inspector’s deployment panel — both now point at the PR-2 commit SHA, not PR-3’s.

  4. Hit /invoices. The page errors: the PR-2 code reads through the dual-read coalesce(invoices.subtotal, invoices.total) fall-through, the Drizzle query reaches for a total column that the contract migration dropped, and Postgres answers column total does not exist. Open Sentry — the error is there, captured within seconds.

  5. Confirm auto-assignment flipped off. When you promote an older deployment by hand, Vercel stops auto-assigning the production alias to new builds so a fresh merge cannot silently overwrite your manual choice. Check it under Settings → Domains.

  6. Re-promote the PR-3 (contract) deployment the same way. The alias swaps back in seconds; /invoices recovers; the inspector’s schema-state panel shows the target shape again (subtotal/tax NOT NULL, no total); Sentry goes quiet after a refresh window. Re-enable auto-assignment once you’ve smoke-tested the restored deployment.

The command that makes the alias swap concrete is the response-header read:

Terminal window
curl -sI https://<APP_URL>
# read x-vercel-id — it identifies the deployment the alias currently serves

If you prefer the CLI to the dashboard, vercel ls --prod lists production deployments and vercel promote <deployment-url> flips the alias — the same gesture, scriptable.

docs/runbooks/rollback.md ships as a stub: the bolded caveat is already written, and three section headers sit empty under it. Your job is to fill those three sections with the gesture you just rehearsed and append the discriminator that tells an application bug apart from a schema mistake. The completed file:

docs/runbooks/rollback.md
# Rollback runbook
How to roll back a bad production deploy — and the one caveat that makes a schema
migration different from a code deploy.
## The caveat
**An alias re-point does NOT undo a forward-only migration.** Pointing the
production alias back at the previous deployment reverts the *code*, but the
database schema has already moved forward — the dropped column is gone. Rolling
back code without a compatible schema is its own outage.
## The four-step alias re-point
When a bad deploy reaches production, this restores the previous code in seconds.
1. **Identify the previous green production deployment.** Vercel dashboard →
Deployments, or `vercel ls --prod` from a terminal. You want the last build
that was healthy before the bad one.
2. **Promote it.** Open its menu → Promote to Production (UI), or
`vercel promote <deployment-url>` (CLI). The alias flips; no rebuild runs.
3. **Verify the swap.** `curl -sI https://<APP_URL>` and confirm the `x-vercel-id`
header matches the deployment you promoted; cross-check the inspector's
build-source panel (the commit SHA) and watch Sentry's error rate fall.
4. **Remember the caveat.** This restores code, not schema. If the incident was a
forward-only migration, the older code may now fail against the current schema
— plan a forward-fix migration (below) as the durable resolution.
## The git revert follow-up
The alias re-point is a stopgap; auto-assignment is now off, so the bad commit is
still the tip of `main`. Make the rollback durable by reverting the code:
1. Open a PR that reverts the bad commit — `git revert <bad-sha>` — and let CI run.
2. Merge after green. The next production deploy ships the reverted code, and `main`
once again matches what's live.
## Re-enabling auto-assignment
Promoting an older deployment by hand turns off automatic alias assignment so a
later merge can't silently overwrite your manual choice. Once the restored
deployment has passed a smoke test, re-enable auto-assignment (Settings → Domains)
so the next merge to `main` resumes shipping to production normally.
## Application bug vs. schema mistake
Before you reach for any of the above, decide which problem you have:
- **An application bug** — the schema is fine, the new code is wrong. Alias re-point
to roll back instantly, then a code-only `git revert` to make it durable. The
database is never touched.
- **A schema mistake** — a forward-only migration changed the shape and the change
itself was wrong. An alias re-point will NOT fix this; the old code fails against
the new schema. The durable fix is a forward-fix migration — for example, re-adding
a dropped `total` as `numeric GENERATED ALWAYS AS (subtotal + tax) STORED`. It is
expensive next to an alias re-point and cheap next to true data-loss recovery, and
it is warranted only when the contract itself was wrong.

A few decisions worth their one sentence each. The four-step gesture stays deliberately tool-agnostic — dashboard or vercel ls --prod / vercel promote — because the 2 AM engineer might be on a phone, not a laptop. The verification step leans on three independent signals (x-vercel-id, the inspector’s commit SHA, Sentry’s error rate) rather than one, because a rollback you can’t confirm is a rollback you can’t trust. The git revert follow-up exists because the alias re-point alone leaves the bad commit as the tip of main, so the very next merge would re-ship it — the runbook closes that trap explicitly. And the discriminator is the section the engineer reads first under pressure: it routes them to the cheap instant fix or warns them off it before they make a code-only rollback against a schema that won’t have them.

The two-layer rollback model — the instant Vercel alias re-point paired with the durable git revert on main, plus why auto-assignment flips off and why a rollback can’t undo a migration — is taught canonically in Two-layer rollback when prod breaks; this runbook is that lesson made specific to this project. For the git revert mechanics — opening the revert PR, why a merged commit gets reverted rather than rewritten — see Reflog, bisect & rescue.

Run the lesson’s test suite:

Terminal window
pnpm test:lesson 6

The suite reads docs/runbooks/rollback.md and passes when the runbook carries its load-bearing structure: the four-step alias re-point, the git revert follow-up, and the re-enable-auto-assignment sections each filled with guidance rather than a bare header; the bolded “does not undo a migration” caveat still present and still bold; and both recovery paths named — the code-only git revert for an application bug and the forward-fix GENERATED ALWAYS migration for a schema mistake. Pass or fail; that is the entire surface.

Everything else in this lesson happened on the Vercel dashboard and against live Sentry and Neon, which no Node test can reach. Confirm those by hand:

Promoting the post-PR-2 deployment flips the alias in seconds; curl -sI and the inspector’s build-source panel both confirm the PR-2 commit SHA is live.
untested
Hitting /invoices raises column total does not exist, and Sentry receives the error during the rehearsal window.
untested
Auto-assignment is off after the manual promote.
untested
Re-promoting the PR-3 deployment restores the target schema and code: the inspector’s schema-state panel shows subtotal/tax NOT NULL and no total, and Sentry goes quiet after a refresh.
untested
The launch checklist’s eight rows are all green at the URL — the rollback-rehearsal row that Lesson 2 deferred is now recorded, closing the checklist.
untested

That last tick closes the project. You shipped a green repo to a live URL, evolved a destructive schema change across three reviewed PRs with the running app and the live database never once out of step, and rehearsed the recovery you hope never to need. The thing you carry out of here is not the invoices app — it’s the discipline: forward-only migrations in three deploys, every change behind a green PR, rollback as a recovery primitive you’ve already practiced, and a runbook waiting for whoever is on call when it counts.