Skip to content
Chapter 99Lesson 2

Which migrations need the cadence

A decision framework for routing any Postgres schema change to one deploy or the full expand-migrate-contract cadence, judged by overlap-window safety and lock cost.

The previous lesson taught the expand-migrate-contract cadence: three pull requests, three reviews, three deploys, and a week or two of calendar time to rename one column without an outage. That is real engineering cost, and you will not pay it for ALTER TABLE invoices ADD COLUMN notes text. The overwhelming majority of schema changes ship in a single deploy, with no choreography at all. So the skill you reach for daily isn’t running the cadence, it’s the call you make before you run it: does this change need three deploys, or one? This lesson gives you a way to tell them apart from a schema diff in seconds. The whole thing turns on one sentence you already met last lesson: a change is safe in a single deploy only if the schema keeps working for both the old code and the new code during the overlap window, the seconds to minutes after a Vercel deploy when both fleets are live against the same database.

The two questions hiding in every schema diff

Section titled “The two questions hiding in every schema diff”

Before any list of “these changes are fine, these aren’t,” you need to see the two questions that decide it. They are separate questions. Collapse them into one, “is this change dangerous?”, and you can’t explain why a perfectly additive index brought production down. An experienced engineer asks them one at a time, because a change can fail on either one independently of the other.

The first question is the one from last lesson. Call it the overlap-window axis: does this change alter a shape the running code reads or writes? Adding a nullable column, a new table, or a new index is invisible to the old code, because it has no compiled query that touches any of those. The old code can’t break on a shape it never names. But the moment you drop, rename, or change the meaning of something the old code reads, you’ve broken one of the two fleets that are live during the window. This axis is about code and schema disagreeing.

The second question is new, and it has nothing to do with your application code. Call it the lock axis: does the migration’s SQL grab a lock that freezes the table long enough to be an outage on its own, even if there were only one version of your code running? This is a property of the SQL statement, not of your deployment. Postgres takes a lock to run any schema change. The question is which lock, and for how long. You only need two named lock classes for the whole lesson.

The heavy one is ACCESS EXCLUSIVE , the lock that blocks everything on the table, including plain SELECT reads. A table rewrite takes it, and so does a non-concurrent index build. If a statement holds ACCESS EXCLUSIVE for two minutes on a table your app reads on every page, your app is down for those two minutes. The light one is SHARE UPDATE EXCLUSIVE : reads and writes keep flowing while it’s held. CREATE INDEX CONCURRENTLY takes this one, and so does the VALIDATE CONSTRAINT you’ll meet shortly. Same work, no outage.

The point to carry is that these two axes are independent. A change can break the old code while locking nothing: a rename is instant, but it desyncs the fleets. A change can lock hard while breaking nothing: a non-concurrent index on a ten-million-row table is invisible to your code and still freezes the table for minutes. A naive NOT NULL promotion does both at once, and ADD COLUMN ... NULL does neither. Two axes, four corners.

Needs a long ACCESS EXCLUSIVE lock?
lock: yes Redesign the SQL, then re-check Index without CONCURRENTLY
Both gates fail Naive SET NOT NULL on a big table
lock: no Ship it, one PR Add a nullable column · new table
Cadence — code is out of sync Rename a column · drop a read column
breaks code: no
breaks code: yes
Breaks the old or new code?
Two independent gates. A change can fail on either axis, both, or neither, and the fix differs by corner.

The two orange corners want different fixes. The bottom-right corner breaks code but locks nothing, and it genuinely needs the cadence: there’s no SQL trick that makes the old code understand a renamed column. The top-left corner locks hard but breaks nothing, and it doesn’t need the cadence at all: it needs you to rewrite the one statement into its lock-light form. That distinction drives the decision tree we’re about to walk, so hold onto it.

The decision tree: three questions, in this order

Section titled “The decision tree: three questions, in this order”

A list of verdicts is tempting, but it fails you. Memorize “rename equals three deploys, add-column equals one” and you’ll meet a change the list never named, say tightening a CHECK constraint, with nothing to fall back on. The durable tool isn’t the list, it’s a short sequence of questions that generalizes to changes no list anticipated. There are three questions, and getting their order right is the core of the lesson.

Q1, is the change additive only? Does the old code keep working completely untouched? If yes, you’re a candidate for one PR, but you’re not done, because additive doesn’t mean lock-free, so fall through to Q2. If no, the change mutates a live shape and you’re heading toward the cadence, though Q2 may still reshape individual statements on the way.

Q2, does any single statement hold ACCESS EXCLUSIVE long enough to matter? This is the question that’s easy to miss, and it is not a verdict, it’s a redesign gate. An index without CONCURRENTLY, a type change that rewrites the table, and a constraint validated without NOT VALID all grab the heavy lock. When the answer is yes, the experienced move is to rewrite the migration into its lock-light form, with CONCURRENTLY, or NOT VALID then VALIDATE, and then ask Q1 again from the top. Only a rewrite you genuinely can’t avoid escalates to the cadence on lock grounds alone. The classic example is intbigint, which physically rewrites every row.

Q3, does the running code read or write a shape that’s about to disappear or change meaning? This is the question that actually triggers the cadence. If yes, that’s three deploys, expand-migrate-contract. If no, that’s one PR, ship it. By the time you reach Q3, Q1 and Q2 have already filtered out the additive changes and reshaped the lock-heavy ones, so Q3 is asking the clean version of the real question.

It’s worth seeing why the questions can only run in this order. Ask Q3 first and you’ll wave through a slow additive index: it doesn’t break any code, so Q3 says “one PR,” and you ship an ACCESS EXCLUSIVE outage. Ask Q2 first and you’ll waste an afternoon lock-proofing a column rename that needs the full cadence no matter how light its locks are. Q1 sorts the additive from the mutating, Q2 reshapes what it can, and Q3 makes the final call on what’s left.

Walk the tree below. Each branch is a question an experienced engineer actually asks, in the order they ask it. Commit to a path one click at a time, the way you would reading a real diff.

Route the migration

The rest of the lesson is that tree’s worked output: the changes that land on “one PR,” the changes that land on “three deploys,” and, most usefully, the handful of changes that look like one but are really the other.

The one-deploy changes: additive and lock-light

Section titled “The one-deploy changes: additive and lock-light”

These ship in a single PR. For each one, look at how it passes Q1, and at how it clears the lock where it would otherwise trip Q2.

A new table is the easiest case. The old code was compiled against a schema that didn’t have this table, so it has no query that names it, and there is nothing to collide with. Zero risk on either axis.

A new nullable column the app doesn’t read yet is a pure ADD COLUMN. The old code is unaware of it. The new code writes to it when it’s ready. Nothing reads a value that might be missing, so there’s nothing to handle.

A new nullable column the new code does read is barely different. The old code is still unaware. The new code reads null for every historical row that predates the column and handles that, usually with a coalesce to a sensible default or a fall-through in the query helper. The column is nullable precisely so the old rows are legal.

A new index is the trap in the green list. It’s additive, since no code names it, so Q1 waves it through. But Q2 catches it: a plain CREATE INDEX takes ACCESS EXCLUSIVE and blocks every write to the table for the entire build, which on a large table is minutes of downtime. The fix is the lock-light form, CREATE INDEX CONCURRENTLY, which builds under SHARE UPDATE EXCLUSIVE while reads and writes keep flowing. You met the CONCURRENTLY hand-edit and its statement-breakpoint marker back in the migrations chapter. The only thing to remember here is that an index is one PR only in its concurrent form.

A new CHECK or foreign-key constraint ships in one PR via the NOT VALID then VALIDATE two-step, as long as the existing data already satisfies it. That mechanic gets its own section next, because it’s the one genuinely new piece of SQL in this lesson.

A new enum value via ALTER TYPE ... ADD VALUE is one PR on the current Postgres the course runs on Neon. It just needs the statement-breakpoint marker, the same as CONCURRENTLY, because it can’t run inside a transaction block.

Here’s what that green list looks like as actual migration SQL. Treat these as recognition targets: when you see this shape in a generated migration.sql, you’re looking at a one-deploy change.

migrations that ship in one PR
ALTER TABLE invoices ADD COLUMN notes text;
CREATE INDEX CONCURRENTLY idx_invoices_status ON invoices (status);
ALTER TYPE invoice_status ADD VALUE 'partially_paid';

The lock-light two-step: NOT VALID then VALIDATE

Section titled “The lock-light two-step: NOT VALID then VALIDATE”

This is the one new SQL mechanic in the lesson, and it’s the hinge that turns Q2 from a dead end into a redesign. It’s worth learning well, because it converts a whole category of would-be-expensive changes into one-deploy changes.

The problem it solves is that adding a constraint the obvious way is slow and blocking. When you write ALTER TABLE invoices ADD CONSTRAINT ... CHECK (...), Postgres has to prove the constraint holds for every existing row before it can accept it, so it scans the entire table and holds a lock the whole time. On a big table that’s minutes of blocked writes. The scan is necessary, but doing it all in one blocking statement is not.

Postgres lets you split that work into two statements that each take a cheap lock.

The first, ADD CONSTRAINT ... NOT VALID, registers the constraint but only enforces it on new rows from this moment forward. It doesn’t scan the existing rows at all, so it’s instant: a brief lock and you’re done. Your constraint is live for everything written from here on.

The second, VALIDATE CONSTRAINT, goes back and scans the existing rows to confirm they satisfy the constraint too. It still reads the whole table, but it does so under SHARE UPDATE EXCLUSIVE, the light lock, so reads and writes keep flowing the entire time. There’s no outage. Once it succeeds, the constraint is fully validated and behaves exactly like one added the blocking way.

See the two side by side. The contrast is the whole point: same constraint, same end state, wildly different lock cost.

ALTER TABLE invoices
ADD CONSTRAINT invoices_amount_nonneg CHECK (amount_cents >= 0);

One statement, full scan, blocked writes. Postgres validates every existing row before accepting the constraint, holding ACCESS EXCLUSIVE for the whole scan. On a large invoices table that’s minutes of no writes, an outage hiding inside a one-line diff.

One naming trap is worth clearing up now, because it catches people every time. NOT VALID is not NOT NULL. They share two words and a similar look, but they mean entirely different things. NOT VALID is the “validate this later” flag you attach to a constraint so it skips the upfront table scan. SET NOT NULL is a constraint promotion: it permanently forbids nulls in a column. You’re about to use both in the same sequence, so note the difference now. NOT VALID is about when validation happens; NOT NULL is about what is allowed.

The three-deploy changes: when the cadence earns its cost

Section titled “The three-deploy changes: when the cadence earns its cost”

These fail Q3, because the running code reads or writes a shape that’s about to change out from under it, so they get the full expand-migrate-contract treatment. For each one, look at why the old code can’t survive the change in a single deploy.

A column rename is the canonical case. Say you’re renaming invoices.customer_name to invoices.client_name, same type, just a clearer name. During the overlap window the old fleet is still selecting customer_name while the new fleet selects client_name, and a single deploy can satisfy exactly one of them. That’s the full three-step cadence, because there’s no SQL trick that teaches the old code a new column name. This is the simplest shape of last lesson’s example; swapping customer_name for a customer_id foreign key is the same problem plus a type and a constraint, which we classify in full below.

A type change that rewrites the table has the most nuance, because not every type change rewrites. varchar(50)text is metadata-only in Postgres: free, instant, one PR. But intbigint physically rewrites every row to widen the column, under ACCESS EXCLUSIVE, and textint can fail partway through on a row that doesn’t parse. Anything that rewrites gets the cadence: add a new column of the target type, dual-write, backfill, switch reads, drop the old. This is the one case the cadence catches from both axes: it breaks nothing in the code sense, but the rewrite lock alone would take you down.

Dropping a column the running code still reads teaches something subtle: the cadence is a shape, not always literally three PRs. Even if your new code has stopped reading the column, the old fleet in the window still reads it, so dropping it now would make the old fleet 500. The change collapses to a two-step read-then-drop: deploy one ships code that no longer reads the column, and deploy two drops it once the old fleet is fully gone. Expand is empty here, because there’s no new shape to add; migrate is just removing the read; and contract is the drop. Same cadence, two deploys instead of three.

Adding NOT NULL to an existing column is the showcase, and it gets its own section below. It’s the one change that fails both axes at once, and its safe form is the single most valuable production pattern in this lesson.

Adding a foreign-key column that becomes required is the full version of the running example, now classified end to end. Expand adds customer_id uuid as nullable, with the FK constraint as NOT VALID. Migrate dual-writes customer_id alongside customer_name, backfills the historical rows, and VALIDATEs the FK. Contract drops customer_name and promotes customer_id to NOT NULL. This exact shape is what you’ll walk hands-on in the next chapter’s project, so recognize it here and build it there.

Here’s that FK-required-column case as three concrete diffs across three PRs, using last lesson’s --name convention so the migration files read as a sequence.

one change, three PRs
-- PR 1 — expand (drizzle-kit generate --name expand_invoices_customer_id)
ALTER TABLE invoices ADD COLUMN customer_id uuid REFERENCES customers (id);
-- PR 2 — migrate (dual-write + backfill live in app code; no DDL)
-- PR 3 — contract (drizzle-kit generate --name contract_invoices_customer_id)
ALTER TABLE invoices DROP COLUMN customer_name;
ALTER TABLE invoices ALTER COLUMN customer_id SET NOT NULL;

One more is named here for recognition only: changing the primary key. It’s the most expensive class, and in practice it often needs a maintenance window or a read-replica swap, well beyond the three-deploy cadence. You won’t do this in the course, so just recognize it as its own tier when you see it in the wild.

Promoting a column to NOT NULL without locking the table

Section titled “Promoting a column to NOT NULL without locking the table”

This is the pattern to carry to real production. It touches both axes at once, its naive form is a classic outage, and its safe form is non-obvious enough that most people learn it the hard way. You won’t have to.

Here’s the trap. ALTER TABLE invoices ALTER COLUMN customer_id SET NOT NULL does two bad things in one statement. It fails outright if any existing row holds a null in that column. And even when every row is already non-null, it takes ACCESS EXCLUSIVE and scans the whole table to prove it, blocking all access, reads included, for the length of the scan. On a large hot table that’s an outage, and you can’t fix it by checking the data first, because the lock is the problem regardless of the nulls.

The safe version is five moves, and it maps cleanly onto the cadence. The payoff is in the last step, so watch where the expensive scan goes.

ALTER TABLE invoices
ADD CONSTRAINT invoices_customer_id_not_null
CHECK (customer_id IS NOT NULL) NOT VALID;
ALTER TABLE invoices
VALIDATE CONSTRAINT invoices_customer_id_not_null;
ALTER TABLE invoices
ALTER COLUMN customer_id SET NOT NULL;
ALTER TABLE invoices
DROP CONSTRAINT invoices_customer_id_not_null;

Deploy one, stop the new nulls. Make the application write customer_id on every insert and update path, the shared mutation path from last lesson’s dual-write discipline. This is application code, not SQL. No new nulls are created from here on, so the set of nulls left to fix is now frozen.

ALTER TABLE invoices
ADD CONSTRAINT invoices_customer_id_not_null
CHECK (customer_id IS NOT NULL) NOT VALID;
ALTER TABLE invoices
VALIDATE CONSTRAINT invoices_customer_id_not_null;
ALTER TABLE invoices
ALTER COLUMN customer_id SET NOT NULL;
ALTER TABLE invoices
DROP CONSTRAINT invoices_customer_id_not_null;

Backfill the existing nulls to a sensible value, batched and idempotent, the same discipline as last lesson. This is application code too. It clears the historical nulls that would make a bare SET NOT NULL fail outright.

ALTER TABLE invoices
ADD CONSTRAINT invoices_customer_id_not_null
CHECK (customer_id IS NOT NULL) NOT VALID;
ALTER TABLE invoices
VALIDATE CONSTRAINT invoices_customer_id_not_null;
ALTER TABLE invoices
ALTER COLUMN customer_id SET NOT NULL;
ALTER TABLE invoices
DROP CONSTRAINT invoices_customer_id_not_null;

Add a CHECK (customer_id IS NOT NULL) as NOT VALID. It’s instant: a brief lock, no scan. This guards the future against new nulls, though note it’s a CHECK constraint, not yet the column’s own NOT NULL.

ALTER TABLE invoices
ADD CONSTRAINT invoices_customer_id_not_null
CHECK (customer_id IS NOT NULL) NOT VALID;
ALTER TABLE invoices
VALIDATE CONSTRAINT invoices_customer_id_not_null;
ALTER TABLE invoices
ALTER COLUMN customer_id SET NOT NULL;
ALTER TABLE invoices
DROP CONSTRAINT invoices_customer_id_not_null;

Validate it. The scan runs under SHARE UPDATE EXCLUSIVE, so reads and writes keep flowing, and it proves no nulls remain. The expensive part, reading every row, now happens under a light lock instead of a blocking one.

ALTER TABLE invoices
ADD CONSTRAINT invoices_customer_id_not_null
CHECK (customer_id IS NOT NULL) NOT VALID;
ALTER TABLE invoices
VALIDATE CONSTRAINT invoices_customer_id_not_null;
ALTER TABLE invoices
ALTER COLUMN customer_id SET NOT NULL;
ALTER TABLE invoices
DROP CONSTRAINT invoices_customer_id_not_null;

Promote to NOT NULL. This is the payoff: Postgres sees the validated CHECK already proving no nulls and skips the table scan entirely, so the promotion is a near-instant metadata flip instead of a minutes-long ACCESS EXCLUSIVE outage.

ALTER TABLE invoices
ADD CONSTRAINT invoices_customer_id_not_null
CHECK (customer_id IS NOT NULL) NOT VALID;
ALTER TABLE invoices
VALIDATE CONSTRAINT invoices_customer_id_not_null;
ALTER TABLE invoices
ALTER COLUMN customer_id SET NOT NULL;
ALTER TABLE invoices
DROP CONSTRAINT invoices_customer_id_not_null;

Drop the scaffolding CHECK. The column’s own NOT NULL now carries the guarantee, so the CHECK is redundant, an optional tidy-up rather than something load-bearing.

1 / 1

It’s worth dwelling on the move that pays off. The expensive SET NOT NULL became a cheap metadata flip because the validated CHECK had already done the proving under a light lock. The cadence here isn’t bureaucracy: every step removes a different, specific risk. Deploy one stops new nulls, the backfill clears old nulls, the NOT VALID constraint guards the future, the VALIDATE proves the present without blocking, and only then does the promotion cost nothing.

Now the changes that defeat memorization outright, the ones whose answer is “it depends.” This is exactly why the tree beats any list. Memorized verdicts catch you on these; the three questions handle them routinely.

Renaming a table splits on Q3. ALTER TABLE ... RENAME TO is metadata-only and instant, so it sails through Q2. But the app-code window has the same problem as a column rename: the old fleet still queries the old table name. So it’s the cadence if the table takes live writes. For a write-rare lookup table, though, a brief maintenance-window cutover can be cheaper than three full PRs. That’s a judgment call the tree surfaces but doesn’t make for you.

Adding a default to an existing column is one PR on current Postgres. ALTER COLUMN ... SET DEFAULT is metadata-only: historical rows keep whatever they had, and future inserts get the default. The one thing worth checking is the Postgres version, because older versions rewrote the entire table to stamp the default onto every row, which would be a Q2 problem. The course’s Neon is current, so the fast path is your default.

Removing a default is metadata-only too. It’s always one PR.

Tightening a CHECK constraint ties straight back to the two-step. If every existing row already passes the stricter rule, it’s NOT VALID + VALIDATE in one PR. If some rows fail it, you must backfill the violators first, which is a migrate step, and then it becomes the cadence. Same change, two different answers, decided entirely by the data.

That’s the lesson of the middle list: the verdict isn’t a property of the change’s name, it’s a property of the change in context. Run the questions.

Try routing a mixed bag yourself. Drag each change into the deploy count it needs, and watch for the middle-list items, which state the condition that decides them.

Route each schema change to the number of deploys it needs. Read the middle-list items carefully — the stated condition decides the answer. Drag each item into the bucket it belongs to, then press Check.

One PR Additive and lock-light
Three deploys Expand, migrate, contract
Add a new table
Add an index with CREATE INDEX CONCURRENTLY
Change a column from varchar(50) to text
Add a CHECK constraint when every existing row already passes it
Add a default to a column (current Postgres)
Swap text customer_name for a customer_id FK column
Drop a column the running code still reads
Promote an existing column to NOT NULL
Change a column from int to bigint
Add a CHECK constraint when some existing rows violate it

In practice this decision doesn’t happen in the abstract. It happens while you’re staring at the SQL drizzle-kit generate just produced. The generated migration.sql is the diff, and it answers Q1 and Q2 directly. You can see ADD COLUMN versus DROP COLUMN. You can see whether CONCURRENTLY is present on an index. You can see a type change. The routing decision is mostly an act of reading.

The one case that needs a human eye is the rename. Drizzle Kit can’t read your mind: when you rename a column in the schema, it sees a column disappear and a similarly shaped one appear, and it emits a DROP COLUMN next to an ADD COLUMN. That pairing is your prompt to decide what actually happened: a true rename, which routes through the cadence, or a genuine remove-and-add, which routes by Q3 on each half. You handled this disambiguation back in the migrations chapter. Here it’s the single most important shape to catch in review, because a rename shipped as a one-PR DROP + ADD is the textbook one-deploy outage.

So review the generated SQL every time, and never push in production. The --name convention from the migrations chapter keeps the filenames legible as a sequence. For schema PRs, the reviewer’s whole job is to read the diff and route it.

Read the PR below the way you’d read a teammate’s. It’s a real generated migration.sql paired with the app-code change that prompted it. Leave a comment on anything you’d flag before approving.

You're reviewing a schema-change PR before it merges. Leave a comment on any line that shouldn't ship as one deploy. Click any line to leave a review comment, then press Submit review.

drizzle/0007_change_invoices.sql
ALTER TABLE invoices ADD COLUMN customer_id uuid REFERENCES customers (id);
ALTER TABLE invoices DROP COLUMN customer_name;
CREATE INDEX idx_invoices_customer_id ON invoices (customer_id);

Tooling that automates the cadence, and why we don’t reach for it

Section titled “Tooling that automates the cadence, and why we don’t reach for it”

You should know the alternative exists. There’s an open-source tool, pgroll, from the team at Xata, that automates the whole cadence. It runs both schema versions side by side using Postgres views, so the old code and the new code each see exactly the schema they expect, and it handles the expand/contract choreography and the lock-safe backfills for you. If a team is running many cadence-class migrations a quarter and wants that choreography off their plate, pgroll is a reasonable reach.

The experienced call: bias toward the cadence when unsure

Section titled “The experienced call: bias toward the cadence when unsure”

One last thing is the one to remember when the three questions don’t give you a clean answer: the cost of getting this call wrong is wildly lopsided.

Get it wrong toward caution by running the cadence on a change that didn’t need it, and you’ve spent a week of calendar time and some review effort on three PRs where one would have done. That’s annoying, but it’s bounded: no production damage, no pager, no 3 a.m. emergency.

Get it wrong toward speed by shipping a rename or a drop in a single deploy, and you’ve taken production down during the cutover, in front of users, with the failure clearing on its own afterward so there’s nothing obvious to revert. That downside has no ceiling.

When the questions don’t give you a confident answer, that asymmetry makes the choice for you: treat the change as cadence-class. The cost of being wrong toward caution is a known quantity; the cost of being wrong toward speed is not.

And run both gates every time. The two gates are independent: one asks whether the change breaks the old code, the other asks whether it holds a long lock. A change that’s clearly additive can still need CONCURRENTLY, so passing Q1 and Q3 tells you nothing about Q2. The reviewer who only asks “does this break the old code?” ships the ten-million-row index that freezes the table for four minutes.

Finally, distrust the sentence “no one will hit this table during the deploy.” That’s the famous-last-words version of the speed-side error. The cutover window is exactly when your cron jobs fire, your ops scripts run, your staging tools poke production, and a stray health check reads the row you just dropped. There is no quiet table during a deploy, only the table whose traffic you didn’t think about.