Chapter 98Lesson 8

The launch checklist

A nine-row pre-launch checklist that verifies every safety net you already shipped is actually live in production, turning a URL that merely renders into one that is defensible.

Your production URL has been live since earlier in this chapter, when the repo first went up on Vercel. The homepage renders. The custom domain resolves over HTTPS, the function sits next to the database, previews get their own Neon branch, and a git push to main is a deploy. Every platform knob you’d touch on day one is tuned. So here is the question an experienced engineer asks before telling anyone the URL: the homepage renders, but is the product launched?

It isn’t, and the gap between those two words is what this lesson covers. “Live” means the server answers. “Launched” means something stricter: the URL is defensible. On a long enough timeline something always breaks, and when it does, a defensible app degrades gracefully instead of falling over. It doesn’t leak data while it’s failing, it doesn’t collapse under a load spike, and it has a human who finds out the failure happened. A live URL makes none of those promises. A launched one makes all of them.

You have already built almost every one of those safety nets. Across the units behind you, you wired environment validation, error monitoring, rate limits, audit logs, security headers, connection pooling, and database restore. None of that is new work. What’s missing is the act of confirming each net is actually live in production, because a net you wired in development and never verified in production is, for safety purposes, a net you don’t have.

So this lesson’s deliverable is a checklist: nine rows, each one a sixty-second check you run against your own deploy. Eight of them verify machinery you already shipped. Exactly one is new code, a small health endpoint you’ll write here, because the one net the rest can’t provide is something for an uptime monitor to ping. Run the list, and you’ll know whether your URL is merely live or genuinely launched.

Launch is a posture, not a URL

Before the rows themselves, it helps to set the mindset they assume. If you read the checklist as bureaucracy, you’ll tick boxes without believing them, and the whole point is lost.

Putting a building on a map is not the same as opening it to the public. A new building gets an occupancy inspection: someone confirms the smoke detectors fire, the exits open from the inside, and the locks actually lock. Nobody confuses “the lights turn on” with “people can safely be inside.” Your URL going live is turning the lights on. The checklist is the occupancy inspection. “It renders” tells you the lights work; it tells you nothing about what happens when there’s smoke.

Two principles run through every row. Naming them once here lets the rows refer back to them later.

The first is that a safety net nobody reads is not a safety net. Wiring an alert is necessary but not sufficient. An error monitor that no human watches, an audit log nobody queries, an uptime check that pages a dead phone: each is wired, and each is useless. Three of the nine rows are inert without a person on the other end, so the human side gets its own section later, because it’s the half that’s both easiest to skip and most expensive to skip.

The second is that the checklist is structural, not ceremonial. Every row maps to one concrete, observable check. An unchecked row means the app is not launched, regardless of how good the homepage looks. This isn’t a feeling that things are “probably fine.” It’s a list that is either green or it isn’t.

The next diagram shows that relationship in space. On the left, “URL is live”: the homepage answers, and that’s the whole story. On the right, “URL is defensible”: the same homepage, now ringed by the nine nets that catch it when it falls. The rows of this lesson are exactly what fills the gap between the two cards.

URL is live

the server answers

URL is defensible

env validation

error monitoring

rate limits

audit logs

security headers

pooled DB

restore history

uptime monitor

runbooks

Live is a subset of launched. The left card is everything 'the homepage renders' gives you; the right card adds the nine verified safety nets that make the URL defensible. The checklist is what turns the left into the right.

The nine-row launch checklist

Here is the checklist itself. It’s interactive: tick each row as you verify it against your own production deploy. The ticks persist across reloads, so you can come back to it. Eight rows are hard requirements; the ninth is soft, and you’ll see why when you reach it. An untested chip on a row means exactly what it says, that you haven’t confirmed the row yet.

Env validation passed in the production build; SKIP_ENV_VALIDATION is not set.

untested

Error monitoring receives a deliberate test exception within seconds, with a readable stack trace.

untested

Auth endpoints return 429 after their rate-limit threshold.

untested

A privileged action writes a fresh row to audit_logs.

untested

curl -sI on the production URL returns all six security headers.

untested

Drizzle uses Neon’s pooled connection string; the function region matches the DB region.

untested

Restore history is set to an adequate window, and one test restore has been performed.

untested

An external uptime monitor pings /api/health and pages a real human on failure.

untested

docs/runbooks/ holds a one-pager for rollback, restore, and credential rotation.

untested

The component holds each row to a single observable outcome, which is all a checkbox should carry. The teaching lives below it. Each row follows the same four-part shape: what it protects, how you verify it (a command or a step you run), where it was wired (so you can jump back if it’s failing), and what you lose if you skip it. Most of the verification is a curl, a SQL query, a dashboard glance, or one deliberate test failure. None of it, apart from the health endpoint, is something you build here.

Row 1: Env validation green in production

Protects: the app booting with every required secret actually present. Verify: open the production build log and confirm the env validator ran and passed, then confirm SKIP_ENV_VALIDATION is not set in the production environment. That flag is an escape hatch for local tooling, never for a real build. Wired in: the env-var lesson earlier in this chapter set the production scope. The validator itself, @t3-oss/env-nextjs plus a Zod schema in env.ts, has been failing the build on a missing required var since you first set up the database. Skip it and: a missing variable stops being a caught build failure and becomes an opaque runtime crash on the first real request, with a stack trace that points at the symptom rather than the empty process.env.

Row 2: Error monitoring wired and receiving

Protects: the team seeing the exceptions the app reports. Verify: confirm your error monitor (Sentry, wired up back in the observability unit) is initialized in instrumentation.ts, then throw a deliberate test error in production and watch it land in the dashboard within seconds. Confirm the source maps uploaded too, so the stack trace points at your TypeScript rather than minified code.

The smallest way to fire that test error is a throwaway route handler you deploy, hit once, and then delete:

export const GET = async () => {
  throw new Error('Sentry test error');
};

Deploy it, hit the URL once in the browser, and confirm the Sentry test error event lands in the dashboard within seconds. Then remove the route, because a deliberate-crash endpoint is never something you leave live in production.

Wired in: the observability unit. instrumentation.ts is Next.js 16’s server-startup hook, the file that runs once when the server boots, which is where the monitor gets initialized. It was named when you set up the request gate, but the wiring lives in that later unit, so this row only confirms it fires. Skip it and: exceptions become invisible. They don’t stop happening, they stop being seen, and they compound silently until a customer emails you about one.

Row 3: Rate limits live on the abuse surface

Protects: your auth endpoints from credential stuffing on day one. Verify: confirm sign-in, sign-up, password-reset, and magic-link all run through the Upstash safeLimit wrapper, then send a burst of requests at one of them and watch the 429s start after the threshold. The cleanest way to generate that load is oha, a Rust load generator with a live terminal dashboard:

# 50 requests, 5 concurrent, POSTed at the sign-in endpoint
oha -m POST \
  -H 'content-type: application/json' \
  -d '{"email":"x@example.com","password":"wrong-on-purpose"}' \
  -n 50 -c 5 https://app.example.com/api/auth/sign-in/email

The method and body matter. /api/auth/sign-in/email is a POST-only handler, so a bare oha <URL> would fire GETs and get back nothing but 405s, and you’d never reach the limiter. Posting a JSON credential body actually hits the sign-in handler. In the oha summary, watch the status-code distribution flip from 400/401 (the wrong-password rejections) to 429 once you cross the limiter’s threshold. That flip is the proof the limit is live in production, not merely configured in code. If you don’t have oha installed, hey -m POST or a plain curl loop generate the same load.

Wired in: the rate-limiting unit, where safeLimit and the dual-key Upstash limiters were built. Skip it and: an auth endpoint with no rate limiting becomes a target the moment it’s public. Credential stuffing is automated and indiscriminate, and it finds new domains within hours.

Row 4: Audit logs writing

Protects: your ability to answer “who did this, and when?” for privileged actions. That’s the question compliance asks in an audit and the question you ask yourself during an incident. Verify: confirm every privileged action (organization membership and role changes, billing changes, data exports) writes a row, then query the table directly and look for recent entries:

select * from audit_logs order by created_at desc limit 10;

Run this against the production database, then perform one privileged action, such as flipping a teammate’s role, and re-run it. A fresh row for the change you just made should appear at the top. Seeing the row you caused is the proof the write path is live.

Wired in: the organizations-and-RBAC unit, where the audit_logs table and the logAudit writer were built. Writes go through logAudit inside the same transaction as the action they record, so the log row and the change it describes commit together or not at all. Skip it and: compliance and post-incident forensics fly blind. The day someone asks who changed a customer’s plan last month, “we don’t log that” is not an answer you want to give.

Row 5: Security headers set

Protects: the browser refusing a whole class of attacks (clickjacking, MIME-sniffing, protocol downgrade, and script injection) on your behalf, before your code even runs. Verify: curl the production URL with headers only and confirm all six are present:

curl -sI https://app.example.com

In the response headers you’re looking for all six: Strict-Transport-Security, Content-Security-Policy, X-Content-Type-Options: nosniff, Referrer-Policy, Permissions-Policy, and X-Frame-Options. As an optional second check, paste the URL into securityheaders.com for a letter grade.

Wired in: the security-baseline unit owns these end to end, both the five static headers in next.config.ts and the per-request nonce CSP in proxy.ts. This row verifies that the headers are present and points you back to that unit; it does not re-derive the header set or re-author the snippet, because that work is already done and shipped. Skip it and: a response that looks perfectly correct in the browser is still frameable, sniffable, and downgradable. The attacks these headers block don’t show up in normal use, only when someone goes looking. One thing worth stating plainly: Vercel adds none of these for you by default, and an empty next.config.ts ships zero security headers.

Row 6: Pooled DB connection with matching region

Protects: the database surviving real load, and your queries not paying a cross-country tax on every call. Verify: two things. First, confirm Drizzle connects through Neon’s pooled connection string, which you can spot by the -pooler segment in the hostname. Second, confirm the production function region matches the Neon database region. Wired in: the region match is the one knob you set deliberately earlier in this chapter, in the region-and-runtime lesson. The pooled-versus-unpooled split lives in the db client from the Postgres-and-Drizzle unit, which exports a pooled connection as the default. Skip it and: two separate failures, both invisible in local dev. Unpooled connections exhaust Postgres’s connection limit under load, and the app starts refusing queries. A region mismatch, which the region lesson covered in depth, adds roughly 80 ms to every query, a tax your average might hide but your p95 won’t.

Row 7: Restore history on and a test restore performed

Protects: your ability to recover from data loss: a bad migration, a fat-fingered delete, a corrupted batch job. Verify: confirm Neon’s instant-restore history window is set to an adequate retention. The default is one day on paid plans; for production, raise it toward seven days or more. Then perform at least one test restore to a Neon branch and confirm the restored data is intact. That test restore is the part that actually matters.

Wired in: the Postgres-and-Drizzle unit provisioned the database on Neon, and the branching you’ll use for the test restore is the same mechanism the preview-branch lesson covered earlier in this chapter. A note on terminology: Neon’s recovery model is instant restore , also called point-in-time restore. It’s not a nightly dump you reload. For a true off-platform copy as a belt-and-suspenders backup, pg_dump to your own storage is the classic move. Skip it and: when data loss happens, and the cause is usually your own code rather than Neon’s failure, you’ll discover whether restore works at the worst possible moment.

Row 8: External uptime monitor that pages a human

Protects: catching the app being down entirely, the one failure your error monitor structurally cannot report, because the app has to be running to report anything. Verify: confirm an external monitor pings /api/health (the endpoint you’ll build in the next section) every minute or so and pages on failure, then confirm that page actually reaches a real human. The current default for this is Better Stack, which bundles uptime checks, on-call scheduling, and escalation in one product, so the “page a human” step is built in rather than something you bolt on. Pingdom, UptimeRobot, and OnlineOrNot are reasonable alternatives.

This row makes a point worth stating outright, because it’s the one juniors most often miss:

Wired in: partly here, since the /api/health endpoint is the one piece of new code in this lesson, and partly an external SaaS you sign up for. Skip it and: your app can be hard-down, returning nothing to every customer, with zero alerts firing, because nothing inside it is alive to notice. You find out when someone tweets at you.

Row 9: Runbooks for the top three incidents

This is the one soft row. Protects: the person responding to an incident at 2 AM, who needs a checklist to follow rather than a memory test under stress. Verify: confirm docs/runbooks/ holds a short markdown file, under one page each, for the three incidents most likely to actually happen: production rollback, database restore, and credential rotation. Wired in: the rollback runbook was named in the rollback lesson earlier in this chapter, and credential rotation in the security-baseline unit. The proper runbook templates are the documentation unit’s job, which is why this row is soft: it checks that the files exist, not that they’re polished. Skip it and: every incident becomes improvisation, performed by a stressed human at the worst hour, reconstructing steps from memory. A half-page runbook written calmly beats perfect recall under pressure every time.

The next drill makes the mapping stick. The diagnostic reflex this whole checklist builds is symptom to net: when something goes wrong, which row would have caught or prevented it? The exercise gives you a set of failure scenarios. Sort each one into the safety net that addresses it.

Each scenario below is a production failure. Drag it into the safety net that would have caught or prevented it. Drag each item into the bucket it belongs to, then press Check.

Uptime monitor External ping that pages a human

Error monitoring Catches exceptions the app reports

Rate limit Caps requests per client

Audit log Records who did what, when

A deploy fails to boot and every user gets nothing at 3 AM — nobody notices for two hours.

The app process is hung: the database is up, but requests never return.

A null-pointer bug throws on the checkout page for 2% of users; the page still loads for everyone else.

A third-party API your app calls starts returning 500s, and your code surfaces a generic error.

An attacker scripts 10,000 sign-in attempts against one account overnight.

A customer disputes that anyone changed their plan, and you need to prove who did and when.

You can’t tell which admin removed a user from an organization last month.

The health endpoint the monitor pings

Row 8’s uptime monitor needs something to ping. The obvious candidate, “does the homepage return 200?”, is a weak signal, and understanding why is the point here. A Next.js page can render perfectly while the database behind it is unreachable: the static shell streams, the 200 goes out, and your monitor sees green while every data-driven action is quietly failing. A homepage 200 proves the web server is alive. It says nothing about whether the app can do its job.

So you ship a dedicated endpoint whose whole purpose is to answer one honest question, “is this app actually able to serve requests?”, by checking the one dependency it can’t function without: the database. It runs a trivial query, and if that query succeeds it returns 200; if it throws, it returns 503. It’s about ten lines, it takes no authentication, and it’s the only real code you write in this lesson.

import { sql } from 'drizzle-orm';
import { NextResponse } from 'next/server';
import { db } from '@/db';

export const GET = async () => {
  try {
    await db.execute(sql`select 1`);
    return NextResponse.json({ status: 'ok' });
  } catch {
    return NextResponse.json({ status: 'degraded' }, { status: 503 });
  }
};

A route handler, not a Server Action, because the caller is a non-browser client. The uptime monitor pinging this endpoint is exactly the first trigger in our route-handler conventions for reaching past a Server Action. The handler is a named GET export.

import { sql } from 'drizzle-orm';
import { NextResponse } from 'next/server';
import { db } from '@/db';

export const GET = async () => {
  try {
    await db.execute(sql`select 1`);
    return NextResponse.json({ status: 'ok' });
  } catch {
    return NextResponse.json({ status: 'degraded' }, { status: 503 });
  }
};

The liveness probe. A select 1 is the cheapest possible “is Postgres answering?” query: it touches no tables and returns instantly. The try/catch is the detail that does the work here. An unreachable database throws, and a check that throws is treated as failure rather than letting the exception escape.

import { sql } from 'drizzle-orm';
import { NextResponse } from 'next/server';
import { db } from '@/db';

export const GET = async () => {
  try {
    await db.execute(sql`select 1`);
    return NextResponse.json({ status: 'ok' });
  } catch {
    return NextResponse.json({ status: 'degraded' }, { status: 503 });
  }
};

The status split. A healthy database returns 200 { status: 'ok' }; a caught failure returns 503 { status: 'degraded' }. A health check that only confirms the process is alive, an always-200, is strictly weaker than one that confirms its critical dependency. The 503 is what trips the monitor.

1 / 1

Two deliberate restraints in that handler. First, the response body is tiny and says nothing specific: degraded, not the connection string or the error message, because this endpoint is public and unauthenticated, and a public endpoint must never leak how it’s wired. Second, it stays cheap, because the monitor hits it every minute, forever. A select 1 is free, while a health check that runs five real queries is a self-inflicted load.

Is anyone watching?

Three of the nine rows (error monitoring, audit logs, and uptime) share a property that’s easy to miss: each is completely inert without a human on the other end. This is the first principle from the top of the lesson, and it gets its own section because wiring the alert is only half the job. The alert reaching someone who acts on it is the other half, and it’s the half that doesn’t show up in any curl.

Start with routing. An alert has to land somewhere a human actually looks. In practice that’s two destinations, not one: a Slack channel that someone reads within the hour during business hours, and an on-call page that wakes someone outside them. The error monitor and the uptime monitor both feed these. This is why uptime tools like Better Stack are worth their price: they bundle the on-call scheduling and escalation, so “page a human” is a configured rotation rather than a hope that the right person happens to be looking at Slack at 3 AM.

Then comes escalation, which has to be explicit. Name who is on-call right now, and name what happens if they don’t acknowledge a page. The backstop is escalation: the page goes to the next person, then to the whole team. An alert with no escalation path dies silently when the one person it targets is asleep with their phone face-down.

Finally, the first week is different. Most launch problems surface in the first seventy-two hours, when real traffic first hits paths your tests never exercised. So for those first three days, budget a few minutes daily to actively watch the dashboards rather than waiting for an alert: the new-error count in your error monitor, the audit log growing as expected, the rate-limit dashboard for unusual spikes, and the function error rate. Treat this as a short, deliberate spend, not a permanent burden. After the first week, the alerts you tuned take over and you go back to being paged only when it matters.

Re-run it, don’t frame it

One last thing about the checklist, so you don’t misread a green run: it is not a launch-day trophy you hang on the wall. It’s a recurring inspection. Re-run the whole list quarterly, and watch the dashboards daily through the first week. Treat any row that was green but isn’t anymore as a regression: a security header silently dropped by a config change, or a rate limit that stopped firing after a refactor. A net that quietly came down is more dangerous than one you knew was never up.

That closes this chapter’s arc. You’ve taken a green CI gate, shipped it to a real production URL, configured the platform an experienced engineer actually configures on day one, learned how to roll back when a deploy goes wrong, and now you can run the launch question against your own deploy and get an honest yes or no.

There’s one thing this checklist has been quietly treating as a black box: the database schema. The list verifies that your database is pooled, region-matched, and restorable, but it says nothing about how you change its shape once real customer data is sitting in it. Adding a column, renaming one, or dropping one against a live database, with traffic flowing and without an outage, is its own discipline. That’s the next chapter: the expand-migrate-contract cadence that lets a production schema change safely while the app keeps serving. For now, you have what this chapter set out to give you, the ability to look at your own live URL and say, with a checklist to back it up, whether it’s launched.

Vercel — Production Checklist

vercel.com

Vercel's own pre-launch checklist for performance, reliability, and security.

Better Stack — Uptime monitoring docs

betterstack.com

Setting up the external monitor, on-call rotations, and escalation policies.

OWASP — Application Security Verification Standard

owasp.org

The reference checklist for what 'secure enough to launch' actually means.