Skip to content
Chapter 94Lesson 7

Indexes and N+1 in production

Diagnosing and fixing the two classes of Postgres slowness, the missing index and the N+1 query, from production traces, and building the checklist and weekly review that keep them away.

The app has been live for months. It was fast on launch day, and then, slowly, it wasn’t. Nothing broke: there was no incident, no alert, no 500. The pages just started taking a beat longer than they used to, and one day a customer mentioned it. You open the traces and find two suspects.

The first is the org invoice list. It is consistently slow, about 280ms on every load for one big tenant. One page, one number, reliable. The second is the audit-log page, slow in a completely different shape: not one slow query but fifty tiny ones where there should have been one. Two slow pages, two different signatures, two different fixes.

You already know how to fix both. Back in Indexes you learned to design the right index, in Spotting N+1 you learned to kill an N+1, and last lesson you untangled RSC waterfalls. What you have not done is operate any of it in production. This lesson is about that gap. You’ll look at a real production trace, recognize which of the two SQL failure classes it is showing you, confirm it, fix it so the fix holds past launch, and then set up the recurring checks that keep both from coming back.

This is the chapter’s closer, the database half of the same vigilance discipline you’ve been building all chapter. The model is small enough to keep in your head: production hands you a signature, the signature names the class, the class names the fix, you confirm the fix, and you schedule the next look. By the end you’ll have two tools for running that discipline: a pre-launch DB checklist and a weekly review habit.

Almost all SQL slowness at production scale is one of two things. That claim is useful, because it means most of the diagnostic work comes down to a single yes-or-no question. The key is that you tell the two apart by how they look in a trace, not by reading the SQL. You don’t open a query plan first. You look at the shape of the spans, and the shape tells you which plan to go read.

A quick refresher on the vocabulary, since you met it last lesson. A span is a single timed operation, so one query is one span. A trace is the whole tree of spans for one request. The two failure classes draw two unmistakable pictures on that tree.

A missing or wrong index looks like one fat span: a single query that takes 100ms or more, sitting there alone, because it is scanning far more rows than it returns. It reads the whole table to hand you twenty rows. There is nothing wrong with the number of queries. There is one query, and it is doing too much work.

An N+1 looks like a staircase of thin spans: many queries, each one fast, from sub-millisecond to a few milliseconds, all the same shape with a different bound parameter. One query loads a list, then one more query runs per row in that list. No single span is slow. The slowness is in the count: fifty round trips where one would do. You saw this exact staircase in the query log back in Spotting N+1; now you’re seeing it in a trace.

Look at the two side by side. They could not be more different.

GET /invoices — 285ms
select … from invoices where organization_id = $1 order by created_at desc limit 20 280ms
0ms
100ms
200ms
300ms
One query — but it scans 50,000 rows to return 20. It is one bar, and it is fat: the slowness is all in this single span.

Now collapse that into the artifact you’ll actually carry out the door. This little table is the spine of the whole lesson, so we’ll call it the two-class table and refer back to it by name.

| | Missing / wrong index | N+1 | | --- | --- | --- | | Production signature | One fat span (100ms+) | A staircase of thin spans | | Confirming tool | explain (analyze, buffers): Seq Scan, Rows Removed by Filter ≫ rows returned | Count statements-per-render, or read the span staircase | | Structural fix | The right index, usually the composite (organization_id, …) | Collapse to one statement: Drizzle with or a join | | Where you learned it | Indexes · Reading EXPLAIN ANALYZE | Spotting N+1 |

Two things to underline before we use it. First, both signatures live in the trace, so you don’t need any special tooling beyond the traces you already turned on when you wired up Sentry. Second, both have fixes that are structural, which means they hold as the app grows. That sets them apart from a cache, which you met last lesson and the one before: a cache hides a slow query behind a saved copy, but the cold request still pays the full cost, and the underlying query is exactly as slow as it ever was. An index or a collapsed query makes the work itself cheaper. That is the difference between vigilance and papering over.

Before going further, prove to yourself you can read a signature and route it to its class. The new skill this lesson is built around is that recognition step, not writing the index or the join, both of which you’ve already done.

Each chip is a production signature — what you'd actually see in a trace, a plan, or a query log. Drop each one into the failure class it points to. Recognizing the class is the whole skill: the class names the fix. Drag each item into the bucket it belongs to, then press Check.

Missing index One fat span → add the right index
N+1 A staircase of thin spans → collapse to one statement
One 240ms span; the plan shows Rows Removed by Filter: 49,900
48 near-identical 2ms spans, same SQL with $1$48
A findMany to load the list, then a findFirst per row to fetch each row’s parent org
A tenant list query scanning the whole table because nothing indexes (organization_id, created_at)
Statements-per-render grows with the number of rows on the page
One slow span that gets worse the deeper a user pages into the list

With the table in hand, let’s run each class through the full loop on a real production page.

The slow invoice list: a missing composite index

Section titled “The slow invoice list: a missing composite index”

Start with the first suspect: the invoice list, one fat span, 280ms. Take it through the loop you already know, which is diagnose, fix, confirm. The only thing new here is the framing. You ran this loop to learn it; now you’re running it as a routine production operation, on a page real users are hitting.

Diagnose. The trace points at one query. Grab the exact SQL, which you already log because you turned on Drizzle’s query logging back in Spotting N+1 and never turned it off, and paste it into the Neon SQL editor. Prefix it with explain (analyze, buffers), the same portable habit from Reading EXPLAIN ANALYZE, and read the plan.

Limit (cost=… rows=20) (actual time=271.4..271.5 rows=20)
-> Sort (actual time=271.4..271.4 rows=20)
Sort Key: created_at DESC
-> Seq Scan on invoices (actual time=0.0..248.9 rows=20 loops=1)
Filter: (organization_id = '…')
Rows Removed by Filter: 49980
Execution Time: 280.1 ms

There it is: Seq Scan on invoices with Rows Removed by Filter: 49980. The query asked for one tenant’s twenty most recent invoices, and Postgres read all fifty thousand rows in the table, threw away the 49,980 that belonged to other tenants, sorted what was left, and handed you twenty. The cause is one line: the query filters on organization_id and orders by created_at, but no index serves that shape, so the only plan available is to scan everything.

This is why it never showed up in development. At fifty rows in your seed database, a full scan of the table is instant: there’s nothing to scan. The cost only appears at production scale, and specifically at multi-tenant scale, where any one tenant’s rows are a thin sliver of a huge table. A sequential scan that’s free at 50 rows is a 280ms tax at 50,000. The symptom stays invisible until the exact conditions that produce it, many tenants and many rows, actually exist, which is to say until production.

The fix is the composite you already know: one composite index on (organization_id, created_at desc, id desc). The ordering follows the leftmost-prefix rule from Indexes: the equality column (organization_id) first, the sort key (created_at) next, the tiebreaker (id) last. There’s no need to re-derive that ordering; you already did. What’s worth saying here, in the production framing, is why this one index matters more than any other in a multi-tenant SaaS: every tenant-scoped list query has this exact shape. Filter by org, sort by recency, page. The invoice list, the customer list, the activity feed all share the same skeleton. One index pattern carries them all, and skipping it is the single most common reason a multi-tenant app that felt fast in staging falls over in production.

src/db/schema.ts
export const invoices = pgTable('invoices', {
// …columns from the schema chapter
}, (t) => [
index('idx_invoices_org_created_at_id').on(
t.organizationId,
t.createdAt.desc(),
t.id.desc(),
),
]);

Confirm. Re-run explain (analyze, buffers). The plan flips from Seq Scan to Index Scan using idx_invoices_org_created_at_id, and the execution time drops from 280ms to about 4ms, the “after” tab above. This is the discipline beat, the reflex Reading EXPLAIN ANALYZE drilled into you: you do not ship an index on faith. An index the planner decides not to use is dead weight, slowing every write while helping no read. So you confirm the plan actually flipped before you call it done.

One more thread to close, back to the start of the chapter. This is a server-side fix, so it lands in TTFB immediately: the moment the index ships, the server returns the page faster. And TTFB feeds straight into LCP, the chain you drew in The Core Web Vitals. So the page didn’t just measure slow, it felt slow to users, because every other paint was stuck behind a 280ms server wait. Fix the query, and the whole front end gets faster for free.

The audit log’s staircase: N+1 at production scale

Section titled “The audit log’s staircase: N+1 at production scale”

Now the second suspect: the audit-log page, the staircase. Same loop, second class.

Diagnose from the trace. The audit-log page shows about fifty thin spans. The first lists the audit entries: select … from audit_logs where organization_id = $1. Then, for every entry returned, one more query fetches the actor’s name: select … from users where id = $2, then $3, then $4, all the way down. One parent query, N child queries. You recognize this immediately as the many-to-one disguise, shape number four from Spotting N+1: load a list, then do a per-row lookup to resolve each row’s parent. The tell is the same one you learned there: statements-per-render grows with the number of rows on the page. Ten audit entries, eleven queries. Fifty entries, fifty-one queries. That linear growth is the fingerprint.

The fix is the collapse you already know. Pull the actor in the same statement, using Drizzle’s relational with narrowed to just the columns you render, and the fifty-one queries become one.

src/db/queries/audit.ts
const entries = await db.query.auditLogs.findMany({
where: eq(auditLogs.organizationId, orgId),
orderBy: desc(auditLogs.createdAt),
limit: 50,
with: {
actor: { columns: { id: true, name: true } },
},
});

The decision between with and a join is the one-liner you already know: a nested tree of objects (each entry carrying its actor) wants relational with, while a flat or aggregated result wants a join. There’s no need to re-teach that call; you’ve made it. What’s worth adding are two production shapes those exercises didn’t drill, because they’re the ones that bite real apps:

  • List-with-children. The invoice list where each row needs a count of its line items. The naive version fetches the list, then runs a count per invoice, which is the same staircase. The fix is relational with plus a columns projection, or a left join with a group by for the aggregate. Same class, same cure.
  • Check-per-row. This is the sneaky one. Each row needs a permission or scope check that hits the database, asking “can this user see this record?”, and the check fires once per row. It doesn’t look like a data-loading N+1 because it’s hiding inside authorization logic, but it draws the identical staircase. The fix is the same: lift the check to one query per scope, not one per row. It’s worth naming because it’s an N+1 that disguises itself as security code, and it slips past review for exactly that reason.

Confirm. The query log is your proof, the same one you used before: it drops from fifty-one statements to one. And to head off the obvious question, why not just cache the actor lookups: because a cache hides, it doesn’t collapse. The first request, and every cold one after the cache expires, still fires all fifty-one queries. Caching the symptom leaves the N+1 fully intact underneath, while collapsing it makes the N+1 cease to exist. That’s the through-line from the two-class table: a structural fix beats a papered-over symptom.

When parallel queries become the new bottleneck: the connection pool

Section titled “When parallel queries become the new bottleneck: the connection pool”

Last lesson left a debt and pointed at this section to pay it. Time to pay it in full.

The fix for RSC waterfalls was Promise.all: fire independent reads in parallel instead of stacking them in a staircase. It’s the right fix. But adopting a fix without watching the new failure mode it opens is how you trade one problem for another, and this fix opens a real one. A wide Promise.all fan-out, or many concurrent requests each firing several queries, can saturate the connection pool . Every in-flight query holds one database connection, and a pool has a fixed number of connections. Run more queries at once than the pool has slots, and the extra queries don’t fail; they queue, waiting for a connection to free up. Your perfectly parallel page is now fast in isolation but slow under load, and the new bottleneck isn’t serialization anymore, it’s connection supply. The bottleneck moved, it didn’t disappear.

Before you go sizing pools, though, note that on this stack the common read path mostly designs the problem away. Here’s why.

HTTP — the default
Serverless function holds no persistent connection
Neon SQL-API proxy holds its own pre-warmed pool
Postgres compute
WebSocket
App server / function holds N connections
Postgres compute pool sizing matters here
Two transports to the same database. Over **HTTP**, Neon's proxy owns the pool, so your function holds nothing. Over the **WebSocket Pool**, your app holds the connections — and a wide fan-out can exhaust them.

Neon’s serverless driver, the default transport you set up back in The serverless driver and the pooled URL, sends each query over HTTP to Neon’s SQL-API proxy. That proxy keeps its own pre-warmed pool of connections to the compute. So your serverless function never holds a persistent Postgres connection: it makes an HTTP request, the proxy borrows one of its connections to run the query, and hands the result back. The classic “every serverless instance grabs a connection and exhausts the database” problem is sidestepped entirely for the one-shot query path, which is the vast majority of your reads. That’s the modern default, and it’s why this whole problem is mostly designed away on the common path.

Mostly. The HTTP path is for single queries and non-interactive work. An interactive transaction , where a BEGIN, several dependent statements, and a COMMIT all need to hold the same connection open across round trips, can’t go over stateless HTTP. That needs the WebSocket Pool path (the node-postgres-compatible one), and that path is where you genuinely hold app-side connections, and where pool sizing comes back to bite.

So if you’re on the WebSocket pool (or classic pg behind PgBouncer), a small pool plus a wide fan-out still saturates. There are two levers here, named rather than built, because the skill is knowing the failure mode exists and which two levers move it:

  1. Size the pool against expected concurrency. Set the pool large enough for the queries you actually fire at peak. The honest way to find that ceiling is to load-test against a Neon branch: spin up a copy of production, push traffic, and watch where queries start queueing.
  2. Bound the fan-out. Cap how many queries a single request fires at once with a bounded-concurrency helper like pMap set to a concurrency limit, so one request can never grab the entire pool and starve every other request. This is the escape hatch last lesson explicitly deferred to here: when Promise.all over a large list is too wide, swap it for a bounded map.

Here’s a production shape the index lessons didn’t cover, and it’s worth a short section because it masquerades as a missing index when it isn’t quite one.

limit / offset pagination re-scans every row it skips, on every page. Page one with a page size of ten reads ten rows. Page one hundred reads about 1,010 rows to return ten, because it has to walk past the thousand it’s skipping to find the ten you asked for. The cost is cheap at the front and degrades linearly with depth. In a trace this looks like a missing-index slow span with one tell-tale twist: it gets worse the deeper users page. That “worse with depth” curve is its distinctive signature. The catch is that even with the right index, offset still pays the skip: the index finds the starting point fast, but offset then throws away everything before your page anyway.

The fix is keyset pagination, also called cursor pagination, and you’ve already built it. Instead of “skip 1,000 rows,” you say “give me the rows after this cursor”: where created_at < $cursor order by created_at desc, id desc limit 10. With the composite (organization_id, created_at, id) index from the first section, that’s O(limit) regardless of depth: page one and page one thousand cost the same, because there’s no skipping, just a seek to the cursor and a read of ten. The cursor mechanics are owned upstream, since you learned the shape in Cursor pagination and the policy in Cursor by default, offset when small, so there’s no need to re-teach them. The production-vigilance point is the one worth carrying: the composite index from the first section is the same index that makes keyset pagination fast. One index, two payoffs.

select * from invoices
where organization_id = $1
order by created_at desc
limit 10 offset 1000;

Linear-degrading by depth. To return 10 rows on page 100, Postgres first walks past the 1,000 it’s skipping, about 1,010 rows read. Cheap on page 1. Even with the right index, the skip is still paid.

Watching for the next slow query: the monitoring surface

Section titled “Watching for the next slow query: the monitoring surface”

Everything so far has been reactive: a trace told you a page was slow, and you went and fixed it. That works, but it means a user felt the pain first. The senior move is to go looking before anyone reports anything, to find the next slow query while it’s still merely the slowest, not yet slow enough to notice.

The surface for that is Neon’s Monitoring page, specifically its query-performance view. It’s powered by pg_stat_statements , a Postgres extension that records execution stats for every statement: how often each ran, total time, and mean time. The monitoring page ranks your queries by those numbers, so the slowest and most frequent ones float to the top. (On plain Postgres without Neon, you query pg_stat_statements directly: same data, no dashboard.) This is where the next slow query announces itself.

This brings the chapter’s recurring-vigilance thread to the database. Last lesson’s habit was “open one slow trace a week.” The database version: once a week, eyeball the top three slowest queries on the monitoring page. If one is a regression, either newly slow or climbing, run explain (analyze, buffers) on it, then route it through the two-class table. Is it a fat span? Index it. Is it a staircase? Collapse it. The discipline isn’t the one-time fix, it’s the weekly look. That’s the database analogue of the CI gate and the weekly trace review you’ve built across this whole chapter.

When you spot a slow one on that page, the first move is automatic.

Your weekly review flags a query that’s just climbed into the top three by mean execution time on the monitoring page. You don’t yet know whether it’s a fat span or a staircase. What’s your first move?

Run explain (analyze, buffers) on it and read the plan.
Add an index on the column in its where clause and move on.
Wrap it in 'use cache' so warm requests skip the latency.
Bump the connection pool size.
Neon — Monitor query performance
neon.com

The monitoring surface this lesson's weekly review uses, powered by pg_stat_statements.

You now have the recurring half of the discipline, the weekly review. The other half is the one-off deep pass you run before you ship, the database twin of the pre-launch Lighthouse audit and the bundle pass from earlier in the chapter. This is the chapter’s closing artifact, and along with the two-class table, it’s the thing to carry out of this lesson.

Run this before launch. There are six checks: one profiling pass over your five most-frequent queries, then five structural guarantees, and every one uses a primitive you already own.

Run explain (analyze, buffers) on the five most-frequent queries (primary-entity list, detail page, dashboard aggregation, search, and login/auth lookup) against a production-shaped branch, never the dev seed. Dev-scale stats lie, exactly as the invoice list showed.
Confirm the composite (organization_id, …) index, the load-bearing multi-tenant index, exists on every org-scoped table.
Confirm every foreign-key column is indexed. Postgres does not index foreign keys automatically, the day-one trap from Indexes.
Confirm every large list paginates by keyset, not deep offset.
Confirm the connection transport and pool sizing match expected concurrency: HTTP for one-shot reads, a sized WebSocket pool only where interactive transactions need it.
Bookmark the Neon monitoring page (and set a slow-query alert if available) so the weekly review has a home.

That closes the chapter, so let it land on the discipline rather than a new API. Two threads from the chapter thesis tie it together. First, defaults before audits: the structural moves, the composite index, the keyset shape, and the right transport, are what actually make the database fast. The monitoring page doesn’t speed anything up; it just shows you where a default leaked. Second, vigilance is recurring: the checklist runs once, at launch, while the weekly review runs forever. Production hands you a signature, the signature names the class, the class names the fix, you confirm the fix, and then you schedule the next look. That loop is the whole job.