Chapter 98Lesson 7

Two-layer rollback when prod breaks

Recovering from a broken production deploy on Vercel by flipping the deployment alias and reverting the bad commit on main.

A merge to main went out ten minutes ago. Sentry’s error rate is climbing, and customers are hitting 500s on a page that worked an hour ago. Under that pressure, an experienced engineer asks three precise questions. What is the fastest path back to a known-good state? What does that path not undo? And how does the code side close the loop so the bug doesn’t ship again on the next deploy? You already have the foundation for the answer. Earlier in this chapter, in “The push-is-the-deploy model,” you learned that production is just an alias pointing at one immutable deployment, and that the previous deployments never went away. Rolling back means re-aiming that pointer, and as you’ll see, it takes two layers that both have to move. By the end of this lesson you’ll have an ordered runbook you could run at 2 AM, and a clear sense of where rollback stops being able to help you.

Production is a pointer you can re-aim

Everything in this lesson rests on one idea you already met, so let’s reload it deliberately. If you only half-remember it, rollback looks like magic, and you’ll reach for the wrong tool under pressure.

Every production deploy you’ve ever shipped left behind a permanent, still-running build on its own <hash>-<project>.vercel.app URL. None of them were overwritten. The build from last Tuesday is up right now, exactly as it was, and so is the one before it. The last-known-good deployment, the one that was serving traffic before the bad merge, is up right now too, on its own URL, fully alive.

That’s what makes the next move so cheap. “Deploy” means re-aiming the production alias at a new deployment. “Rollback” means re-aiming that same pointer backwards, at a deployment that already exists: the same operation, opposite direction. Rollback triggers no rebuild. Nothing compiles, nothing gets packaged, no artifact is produced, because the target is already built and already running. That’s why rollback is instant: you’re not making anything, you’re just moving a label.

production alias app.example.com

v1 older

fix invoice total

v1-app.vercel.app

v2 last-known-good

add export button

v2-app.vercel.app

v3 BROKEN

new pricing page

v3-app.vercel.app

rollback = slide the pointer left

Deploys add a box on the right and slide the pointer to it; rollback slides the pointer left. The boxes never move or disappear — every previous deployment stays live on its own URL, which is exactly what makes rollback instant.

This also tells you exactly which deployments you’re allowed to roll back to. The rule is one question: was this deployment ever aliased to production? Every build that served live traffic, even for one minute, is a future rollback target. Preview deployments never held the production alias, so they are never rollback targets, because production was never there to go back to. We’ll return to that boundary in a moment; for now just keep it in mind.

Layer 1: flip the alias back

This is the fast, first move, the one you make while the site is on fire. Learn to do it on its own, and resist the urge to “fix it properly” in the same breath. Conflating the alias flip with the code fix is the single most common rollback mistake, and the rest of this lesson exists to prevent it. The alias flip buys you minutes; it is not the end of the incident.

The dashboard path is two clicks. On the production deployment’s tile, click Instant Rollback, or from the Deployments list, open the ⋮ menu on any row and choose Instant Rollback. Pick the last-known-good deployment, confirm, and traffic flips within seconds; give the edge cache under thirty seconds to settle. There is no rebuild. On Pro and above, “Choose another deployment” lists every eligible deployment in your history, not just the immediate predecessor. On the Hobby tier you’re limited to rolling back to the immediately previous one.

Production

new pricing page

a1f9c2e main

Ready 12m ago

The production deployment tile with the Instant Rollback action. After a rollback this same control becomes 'Undo Rollback' — the button you press once the fix is verified.

The dashboard is fine when you’re sitting in front of it. But an experienced engineer keeps the CLI within reach too: it’s what you script into an incident-response tool, and what you fall back on when the dashboard is slow at the worst possible moment. Three commands carry the whole job.

vercel ls                          # list deployments — find the target URL
vercel rollback                    # re-alias to the immediately previous prod deploy
vercel promote <deployment-url>    # re-alias to a specific deployment

The difference between the bottom two commands is the one that trips people up, so make it a rule you can say out loud. vercel rollback re-aliases to the immediately previous production deployment. That’s fast but blunt: it only does the right thing if the bad deploy is the latest one. vercel promote <url> re-aliases to a specific deployment you name, which is the precise, durable move when the known-good build you want isn’t the immediate predecessor. The decision rule: bad deploy is the latest → vercel rollback; you need a specific older good one → vercel promote <url>. Run vercel rollback when the bad commit isn’t the latest, and you’ll roll back to a different broken-or-stale build and think you’re done. So run vercel ls first, identify the exact known-good URL, then promote precisely.

Layer 2: undo the bad commit on main

The alias flip stopped the bleeding. But notice what it did not touch: main still has the bad commit sitting on top of it. The artifact you rolled back to is fine, but the source of truth your next deploy builds from still contains the bug. The moment anyone merges anything to main, say a teammate’s unrelated feature an hour from now, that merge builds from a main that still includes the broken code, and you’ve silently re-shipped the exact thing you just rolled back. The alias flip treats the symptom; the commit on main is the cause, and Layer 2 is where you address it.

The cure is the inverse commit. You met it back in the Git chapter: git revert <bad-sha> creates a new commit that applies the inverse of the bad one. Nothing is rewritten and no history is lost. It’s an honest, forward commit that happens to undo a previous one.

git revert <bad-sha>

Here’s the part worth sitting with, because incident pressure pushes hard against it. That revert does not get a special fast lane. It flows through the exact same gated process as any other change: you open a PR, the four-job CI runs its roughly four-minute build, it merges to main, and that merge produces a fresh production deployment with the fix baked in. Under fire, the gate feels like it’s in your way. Why am I waiting four minutes to ship a one-line revert when the site is broken?

Wait anyway, and don’t bypass the gate on reverts. A revert is code, and code can be wrong: it can conflict, it can accidentally drag along an unrelated change, it can even be a revert of a revert that re-introduces the original bug. The gate is cheap insurance against turning one incident into two, and “skip CI when it’s urgent” is exactly the habit that manufactures the next urgent moment. You can afford to wait precisely because you did Layer 1 first: the alias flip already bought you the minutes the gate costs. Production is already safe on the old build, so the four-minute build isn’t downtime. It’s just the clock running on a fix while users keep getting served the last-known-good.

What rollback resets, and what it can’t touch

Here is the second idea that has to land, and it’s the one that produces the most painful surprises: rollback is not a time machine. It flips a code pointer, and that is the entire extent of its power. People model it as “undo the last ten minutes,” then stare at a database that’s still wrong and a customer who still got charged, wondering why the rollback “didn’t work.” It worked exactly as designed. It just does far less than the intuition assumes.

Draw the boundary cleanly. Rollback resets everything that was frozen into the old deployment’s build: your application code, the server function bundle, the static and pre-rendered HTML, and the values of any environment variables that were inlined at that deployment’s build time. Recall from the env-vars lesson that NEXT_PUBLIC_* and other build-time reads get soldered into the artifact. Snap back to the old build and you snap back to all of it together, because it’s all one frozen package.

Rollback cannot touch anything that lives outside that frozen package. The rows the bad deploy wrote or mutated in your database are still there. Every side effect that already went out the door stays out: emails sent, Stripe charges captured, webhooks dispatched, files written to R2. One subtle case is worth a sentence: rollback runs the old build, but it does not revert your project’s current environment-variable configuration in the dashboard. Those live values are unchanged; only what was baked into the old artifact comes back.

Rollback resets

Frozen into the old build

Application code
Server function bundle
Static / pre-rendered HTML
Build-time env values (NEXT_PUBLIC_* inlined into the artifact)

Rollback can't touch

Lives outside the build

Database rows the bad deploy wrote or changed
Emails already sent
Stripe charges already made
Webhooks dispatched / files written to R2
The project's current dashboard env config

Rollback flips a code pointer and nothing more. Everything in the right-hand column survives the rollback untouched — which is why 'I rolled back, why is the data still wrong?' is the most common surprise.

So fix this takeaway in your mind before it bites you in production: the data-state problem is a separate problem. Rolling back the code does not roll back the data. If the bad deploy corrupted rows or double-charged customers, the alias flip stops the bleeding but does nothing for the wound. That’s a forward fix you write deliberately, not something a pointer swap hands you for free. Test it on yourself.

An alias rollback just flipped production to the previous deployment. Sort each thing by whether the rollback reset it or left it exactly as the broken deploy left it. Drag each item into the bucket it belongs to, then press Check.

Rollback reset it Frozen into the old deployment's build

Rollback can't touch it Lives outside the build

The buggy version of the checkout page component

A NEXT_PUBLIC_FEATURE_X value inlined at build time

The server function bundle

Rows the bad deploy inserted into the invoices table

A welcome email the bad deploy already sent

A Stripe charge the bad deploy captured

A webhook the bad deploy already dispatched

The current value of STRIPE_SECRET_KEY in the dashboard

The split tests the same boundary as the figure: frozen into the build versus living outside it. If you got the database row or the Stripe charge wrong, re-read the takeaway above. Those are exactly the surprises rollback can’t undo.

The migration trap rollback won’t save you from

There’s one interaction nasty enough to deserve its own warning, because it’s exactly where “just roll back” makes things worse: the bad deploy ran a destructive schema migration.

Walk through the sequence. The migration already changed your database, say it dropped a column the new code stopped using. Then you flip the alias back to the old code. But the old code was written for the old schema, and that column is gone. Now you’re running yesterday’s code against today’s mutated database, and the mismatch can break the app more thoroughly than the bug you were rolling back from. The alias flip restored the code; it has no power over the schema.

You can’t simply un-migrate your way out, either. Drizzle migrations are forward-only: there’s no “rewind” button. The fix for a bad schema change is another schema change, a new forward-fix migration you write to repair the state. You go forward to safety, never backward.

Why does this rarely bite a well-run project? Because of that cadence the next chapter teaches. Expand-migrate-contract is designed so that each migration step is safe for the previous deploy’s code, which means a code rollback stays compatible with the new schema by construction. You don’t need to know how it works yet. Just know that “roll the code back and the schema is fine” is something you earn with disciplined migrations, not something the platform guarantees.

Re-enabling auto-assignment without re-shipping the bug

This one bites after you think the incident is over, which is what makes it dangerous. It’s a silent failure mode, and it interacts directly with the two-layer model.

The instant you do a manual rollback, Vercel quietly turns off auto-assignment of the production alias to new pushes. This is a deliberate safety measure: it stops a teammate’s next push to main from instantly re-shipping the broken code on top of the rollback you just performed. That’s sensible, but it has a consequence beginners walk straight into.

While auto-assignment is off, pushes to main still build, they just don’t take the production alias. So your git revert merges, a new deployment builds, CI goes green, the dashboard shows a healthy “Ready” deployment, and production does not change. Nothing is broken and nothing is wrong, but your fix isn’t live. People stare at a successful deploy that simply isn’t serving anyone and lose ten confused minutes wondering why the bug is “still there.”

The fix is an ordering, the same shape as everything else in this lesson. Do the steps in the right sequence and there’s no trap at all.

Let the git revert land on main through the normal gated PR.
The new production deployment builds. Verify it on its own deployment URL: open the <hash>-<project>.vercel.app link and confirm the fix actually works there, before it touches live traffic.
Only then re-enable auto-assignment by promoting the verified fix. The dashboard surfaces this as the Undo Rollback button on the production tile, or vercel promote <url> from the CLI. Promoting is the act that turns auto-assignment back on.

The trap is re-enabling before you’ve verified. Flip auto-assignment back on too early and you’ve re-armed the auto-ship of whatever lands on main next, which, if your revert wasn’t actually the fix, is the next surprise in production. Verify first, then re-enable, always in that order.

The incident-response runbook

Now put the pieces together. You’ve learned the individual moves; the skill an experienced engineer brings to an incident is the order under pressure. A written runbook is the thing you open at 2 AM instead of reasoning from scratch with adrenaline in your veins. Here is the canonical shape.

Detect. An alert fires: a Sentry error spike, or an uptime check going red. Something tells you before a customer has to.
Triage. Confirm it’s real, not a blip, and identify the bad deployment in the Deployments list.
Roll back the alias (Layer 1). Promote the last-known-good deployment and verify traffic actually flipped by hitting the production URL. This is the fastest move and it comes first, because it stops the bleeding.
Communicate. Status page, Slack, customer comms as appropriate. People knowing what’s happening is part of the response, not an afterthought.
Revert the code (Layer 2). git revert the bad commit, open the gated PR, let CI pass, merge to main.
Re-enable auto-assignment, but only after the reverted-main deployment has built and you’ve verified it on its own URL.
Postmortem. What got past CI, and what’s the structural fix so this class of bug can’t recur.

Steps 4 and 7, communication and postmortems, are real parts of incident response with their own depth that a later unit owns; they’re named here so the runbook is complete, not skipped. And the runbook itself isn’t lore in someone’s head: it lives as a markdown file in docs/runbooks/, and the launch checklist in the next lesson verifies that file actually exists. A runbook nobody wrote down is a runbook you don’t have at 2 AM.

The load-bearing thing about that list is the sequence, so test yourself on it directly. The following exercise gives you the same steps, shuffled; drag them back into the order you’d run them in a live incident.

Production is down. Drag the incident-response steps into the order you'd run them, from the first alert to the cleanup. Drag the items into the correct order, then press Check.

Detect — an alert fires (Sentry error spike or uptime check)

Triage — confirm it’s real and find the bad deployment

Roll back the alias — promote the last-known-good, verify traffic flipped

Communicate — status page / Slack / customer comms

Revert the code — git revert, gated PR, merge to main

Re-enable auto-assignment — only after the reverted build is verified

Postmortem — what got past CI, and the structural fix

The two moves the shuffle most wants you to fumble are the load-bearing ones: putting revert the code before roll back the alias (the alias flip stops the bleeding first), and re-enabling auto-assignment before the reverted build is verified (you’d re-arm the auto-ship of whatever lands on main next). Alias first, revert second, and verify before you re-enable.

The escape hatch: flip the flag instead

One last move, saved for the end because it reframes everything above: the fastest rollback is the one that touches no deployment at all.

If the broken behavior is gated behind a feature flag , the PostHog flags you wired up back in the analytics chapter, then flipping that flag off beats every rollback in this lesson. No new deployment, no alias change, no DNS, no waiting on an edge cache. A feature flag is a config value read at runtime, so turning it off takes effect in seconds, everywhere, with nothing to build and nothing to promote.

That reorders the whole hierarchy of responses, and the ordering is the reflex worth building: flag flip (if the behavior is flagged) > alias rollback > forward-fix migration, cheapest and safest first. That points at the preventive version of the same idea, which is the real payoff: ship risky features behind a flag in the first place. Do that, and the rollback story for any new behavior stops being “scramble the alias under pressure” and becomes “flip the flag.” The flag is the strongest layer precisely because it’s the one you set up before the incident: preventive, not reactive.

That reframe deserves a blunt closing note, because it marks the difference between treating rollback as a tool and treating it as a crutch.

One last gap: caches in front of Vercel

One short completeness note before you close the book on this. If you’ve put a CDN in front of Vercel, the Cloudflare-in-front pattern named earlier in this chapter, instant rollback at Vercel does not reach through and purge that cache. Vercel flips its alias, but Cloudflare keeps serving whatever it cached from the broken build until its TTL expires. So if that layer exists, a cache-purge step belongs in the runbook, right after the alias flip; alternatively, keep short TTLs on anything dynamic so the cache can’t outlive a rollback by long. The same goes for any CDN above Vercel. It’s a small gap, but it’s the kind that makes a “successful” rollback look like it didn’t take.

That’s the model and the runbook. Two layers, and both must move: the alias flip for speed and the git revert for durability, with a clear-eyed sense of what neither one can undo. Next is the lesson that ties off the whole chapter, the pre-launch checklist that confirms every safety net you’ve built, including the runbook you just wrote, is actually wired before the URL goes public.

External resources

Performing an Instant Rollback on a Deployment

vercel.com

Vercel's own walkthrough of Layer 1 — eligible deployments, the auto-assignment shutoff, and Undo Rollback.

vercel rollback (CLI reference)

vercel.com

The exact CLI surface for the incident-response tooling — flags, status, and the promote-to-undo path.

Schema changes and the power of expand-contract

xata.io

The migration trap, solved — how expand-migrate-contract keeps a code rollback safe against a mutated schema.

Feature Toggles (aka Feature Flags)

martinfowler.com

The canonical reference behind the escape hatch — flags that decouple release from deploy.