Chapter 38Lesson 8

Full-text search in Postgres

Build a word-aware, ranked search box directly in Postgres with tsvector, websearch_to_tsquery, and ts_rank, and learn the threshold where you'd reach for a dedicated search engine instead.

The invoices list works. Now product wants a search box above it: type a few words, get the matching invoices back, ranked by relevance. Your first instinct is the tool you already have. You learned ilike in the CRUD lesson, so you reach for it:

db.select()
  .from(invoices)
  .where(ilike(invoices.description, `%${term}%`));

Run that and it sort of works, until it doesn’t. There are four problems, and each one is something a real user will do in the first five minutes:

They search invoices and the row that says “invoice” never shows up. %invoices% is a literal substring; it has no idea that “invoice” and “invoices” are the same word.
They search overdue payment and the row that says “payment overdue” is missing. A substring match can’t reorder words.
The leading % means Postgres can’t use a normal index, so it scans every row and the box gets slower as the table grows.
They search the and every single invoice comes back, because %the% matches “the”, “theme”, “other”, and “weather”.

That last problem isn’t something you can patch; it’s a sign you’re using the wrong tool. ilike matches characters, but search needs to match words, with all their messiness: plurals, tenses, word order, and noise words like “the”. Matching words is what Postgres full-text search is built for, and it ships in the box.

That leads to the question this lesson is really about. A SaaS app needs search, so is that a Postgres feature or a separate service like Algolia or Meilisearch? It’s a build-versus-buy call, and for the great majority of apps in 2026 the answer is to reach for Postgres first. Its full-text search comfortably covers a corpus into the low millions of documents at modest query rates, which describes almost every SaaS that already runs Postgres. An external search engine is a real power tool, but the threshold where you need it sits further out than you’d think. By the end of this lesson you’ll have a ranked, highlighted, injection-safe search query running against Postgres, and you’ll know the line where you’d leave it for something else.

Lexemes, vectors, and queries: the mental model

Before any Drizzle, fix the one idea everything else builds on. Full-text search does not compare strings. It compares normalized words, and it normalizes both sides, the stored text and the search term, into the same shape before comparing them. Once you have this picture, every surprising behavior later stops being surprising.

The whole engine is three terms and one operator.

A lexeme is a normalized word stem. The words running, ran, and runs all reduce to the lexeme run. This normalization (lowercasing, stripping plurals and verb tenses, dropping noise words) is the whole trick. It’s what lets a search match meaning instead of characters, which is exactly the thing ilike couldn’t do.

A tsvector is a document that’s already been broken down into its lexemes. Run to_tsvector('english', 'The cats are running') and Postgres gives you back 'cat':2 'run':4: the and are are dropped as noise, cats becomes cat, running becomes run, and each surviving lexeme remembers what position it held. The original sentence is gone; what’s left is its meaning, in a form built for matching.

A tsquery is a search expression living in that same lexeme space. The query 'run' matches the vector above. The query 'cat & run' matches it too (both lexemes are present). The query 'dog' doesn’t match (that lexeme isn’t there).

And @@ is the match operator that connects them. tsvector @@ tsquery returns a boolean: does this document contain these search lexemes? That single line is the engine, and everything else in this lesson is plumbing around it.

The diagram below shows the whole flow. Watch how both sides, the stored document and the search term, pass through their own normalization and arrive as lexemes, and only then meet at @@. This one move, normalizing both sides into the same space before comparing, explains every later behavior at once: why a plural matches a singular, why “the” disappears, and why an exact-substring search will sometimes surprise you.

stored document “The cats are running”

normalizer to_tsvector('english', …)

cat _:2 run _:4

stop words dropped, words stemmed

search term “running cat”

normalizer websearch_to_tsquery('english', …)

catrun

user input, same normalization

@@ match

Both the stored text and the search term are normalized into lexemes, then compared with `@@`. The match works on meaning, not characters.

The 'english' argument in those calls does real work. It picks the text search configuration , which sets the stemming rules and the stop word list. 'english' knows English plurals, tenses, and noise words. There’s also a 'simple' config that does no stemming and keeps every word. That one is useful when you want near-exact tokens, like searching SKUs or code identifiers where “running” should not collapse into “run”. For a single-language SaaS, hardcode 'english' and move on. For a multilingual one, you’d store the language per row and pass it in dynamically, which is worth knowing about but not worth drilling here.

A first search, computed at query time

Now make it runnable. We’ll write the simplest version that works first, deliberately the unoptimized one, because the optimization later only makes sense once you’ve seen what it fixes.

The query builds the vector right there in the where clause, from the two columns you want to search, description and customerName:

db.select()
  .from(invoices)
  .where(
    sql`to_tsvector('english', coalesce(${invoices.description}, '') || ' ' || coalesce(${invoices.customerName}, '')) @@ websearch_to_tsquery('english', ${term})`,
  );

The left side of the match builds a tsvector on the fly, concatenating the two columns with || (SQL string concat). coalesce(description, '') swaps a null description for an empty string so the concat doesn’t collapse the whole expression to null. This is the same nullability care from the schema chapter, since description is nullable.

db.select()
  .from(invoices)
  .where(
    sql`to_tsvector('english', coalesce(${invoices.description}, '') || ' ' || coalesce(${invoices.customerName}, '')) @@ websearch_to_tsquery('english', ${term})`,
  );

The match operator sits in the middle, asking “does this row’s text contain the search lexemes?” and returning the boolean Drizzle drops straight into the where.

db.select()
  .from(invoices)
  .where(
    sql`to_tsvector('english', coalesce(${invoices.description}, '') || ' ' || coalesce(${invoices.customerName}, '')) @@ websearch_to_tsquery('english', ${term})`,
  );

The right side turns the user’s raw search string into a tsquery. ${term} is interpolated, but inside a sql template it binds as a $1 parameter, the same injection-safe path as every eq(col, value) you’ve already written. The user’s text never becomes SQL structure.

db.select()
  .from(invoices)
  .where(
    sql`to_tsvector('english', coalesce(${invoices.description}, '') || ' ' || coalesce(${invoices.customerName}, '')) @@ websearch_to_tsquery('english', ${term})`,
  );

The whole expression lives inside an ordinary Drizzle where. The builder owns the query’s structure; the sql template fills in this one predicate. It’s the embedding pattern from the subqueries lesson, pointed at a new function.

1 / 1

This runs, and for a demo table it’s fine. But look at what it asks Postgres to do: to_tsvector(...) executes once per row, on every single search. Every time someone types in the box, Postgres re-reads and re-tokenizes the text of every invoice in the table before it can compare anything. There’s no index it can use, because the thing being matched doesn’t exist until the query runs. At a few dozen rows you won’t notice. At a few thousand it’s a sequential scan that drags, and at real volume it stops responding. That’s the exact problem the next section fixes. The naive shape is fine to write first, but you don’t want to ship it.

One more decision in that query is worth pausing on: which function turns user input into a tsquery. There are several, and only one is safe to point at a raw search box.

to_tsquery (raw input → throws)
websearch_to_tsquery (raw input → safe)

-- user typed: overdue payment
to_tsquery('english', 'overdue payment')
-- ERROR: syntax error in tsquery: "overdue payment"

to_tsquery expects operator syntax, not human text. A bare space between words is a syntax error; it wanted overdue & payment. It also exposes raw operator characters (&, |, !, :) to whatever you pass it. So pointing it at a user’s search box throws on the first multi-word query and leaks query-language syntax. Use it only for queries you build yourself from trusted parts.

-- user typed: overdue payment
websearch_to_tsquery('english', 'overdue payment')
-- → 'overdue' & 'payment'   (parsed, no error)

websearch_to_tsquery parses Google-style input safely. Bare words get AND-ed, "quoted phrases" stay together, or does what you’d expect, and -word excludes. It never throws on arbitrary input and exposes no operator syntax. This is the only function you point at a raw search box.

The rule to carry out of here: websearch_to_tsquery is the function you point at user input, every time. to_tsquery is for server-constructed queries assembled from parts you control. Mix them up and your search box throws an error the first time someone types two words. The coalesce guarding the nullable column is just coalesce , a one-line guard rather than a topic of its own.

Moving the vector off the hot path: the generated column

The fix for “re-tokenizes every row on every query” is to tokenize each row once, when it’s written, and store the result. You already know the mechanic: it’s a STORED generated column from the schema chapter, the kind that’s computed at write time and stays in sync automatically. We just point it at full-text search: a column whose value is the tsvector for that row, recomputed by Postgres on every insert and update, never by the query.

There’s one Drizzle-specific snag that trips everyone, so let’s clear it up front. Drizzle ships no tsvector column builder. There’s no tsvector('search_vector') the way there’s text(...) or integer(...), so you define your own column type once with customType:

import { customType } from 'drizzle-orm/pg-core';

const tsvector = customType<{ data: string }>({
  dataType() {
    return 'tsvector';
  },
});

That’s the whole helper. customType takes the shape of the data on the TypeScript side ({ data: string }, since a tsvector reads as a string in JS) and a dataType() returning the literal SQL type name. Now tsvector(...) behaves like any other column builder. With that in hand, the column on the table reads almost exactly like the inline expression from before, just attached to the schema instead of the query:

const tsvector = customType<{ data: string }>({
  dataType() {
    return 'tsvector';
  },
});

export const invoices = pgTable('invoices', {
  // …existing columns
  searchVector: tsvector('search_vector')
    .notNull()
    .generatedAlwaysAs(
      (): SQL =>
        sql`to_tsvector('english', coalesce(${invoices.description}, '') || ' ' || coalesce(${invoices.customerName}, ''))`,
    ),
});

The custom column type, defined once. { data: string } is how the value looks in TypeScript, since a tsvector reads as a string in JS, and dataType() returns the literal SQL type name tsvector. This is the import you add: customType from drizzle-orm/pg-core.

const tsvector = customType<{ data: string }>({
  dataType() {
    return 'tsvector';
  },
});

export const invoices = pgTable('invoices', {
  // …existing columns
  searchVector: tsvector('search_vector')
    .notNull()
    .generatedAlwaysAs(
      (): SQL =>
        sql`to_tsvector('english', coalesce(${invoices.description}, '') || ' ' || coalesce(${invoices.customerName}, ''))`,
    ),
});

generatedAlwaysAs takes a callback returning a SQL expression, the same vector-building expression from the query-time version, now computed at write time. Note the explicit : SQL return annotation: sql comes from drizzle-orm, and the type from import type { SQL } from 'drizzle-orm' (the project’s verbatimModuleSyntax requires the import type form for the type-only import).

const tsvector = customType<{ data: string }>({
  dataType() {
    return 'tsvector';
  },
});

export const invoices = pgTable('invoices', {
  // …existing columns
  searchVector: tsvector('search_vector')
    .notNull()
    .generatedAlwaysAs(
      (): SQL =>
        sql`to_tsvector('english', coalesce(${invoices.description}, '') || ' ' || coalesce(${invoices.customerName}, ''))`,
    ),
});

The same nullable guard as before: description can be null, so coalesce(col, '') keeps the concatenation from collapsing to null. The logic didn’t change; only when it runs did.

const tsvector = customType<{ data: string }>({
  dataType() {
    return 'tsvector';
  },
});

export const invoices = pgTable('invoices', {
  // …existing columns
  searchVector: tsvector('search_vector')
    .notNull()
    .generatedAlwaysAs(
      (): SQL =>
        sql`to_tsvector('english', coalesce(${invoices.description}, '') || ' ' || coalesce(${invoices.customerName}, ''))`,
    ),
});

The vector is always derivable, since the coalesce guarantees a value even when both source columns are empty, so the column is NOT NULL. Every row has a searchable vector.

1 / 1

Why generated rather than something you maintain yourself? Because the column can never drift. Every insert and every update recomputes the vector straight from description and customerName, so there’s no trigger to write and no application code that has to remember to rebuild the vector after an edit. If you’ve read older Postgres codebases you’ve seen the pre-generated-columns version of this: a BEFORE INSERT OR UPDATE trigger calling to_tsvector by hand. That’s the pattern generated columns replaced. Recognize it in legacy code, but don’t write it.

One dependency to name now, even though you won’t act on it here. A tsvector column is only fast to query if it has a GIN index behind it. Without one, Postgres still has to look at every row’s vector: you’ve moved the tokenization off the hot path but not the scan. The index has this shape, following the project’s naming convention:

(t) => [index('idx_invoices_search_vector_gin').using('gin', t.searchVector)];

Recognize it, but don’t worry about it here. Declaring and tuning indexes (what GIN actually does, what it costs on writes, how to verify it’s used) is the job of the first lesson of the next chapter, on performance and integrity. This lesson gets the query correct; that one makes it fast. The two ship together in practice: a generated tsvector column and its GIN index are a pair, and you’d add both in the same migration. We’re only drawing the line at where this lesson stops.

Here’s the payoff. With the vector precomputed and stored, the search query’s left side collapses from that whole to_tsvector(...) expression down to a bare column reference:

Query-time vector (re-tokenizes every row)
Generated column (@@ hits the GIN index)

db.select()
  .from(invoices)
  .where(sql`to_tsvector('english', coalesce(${invoices.description}, '') || ' ' || coalesce(${invoices.customerName}, '')) @@ websearch_to_tsquery('english', ${term})`);

The left side is rebuilt for every row on every query. No index can apply, because the matched value doesn’t exist until the query runs. It’s fine for a demo, but it becomes a sequential scan at volume.

db.select()
  .from(invoices)
  .where(sql`${invoices.searchVector} @@ websearch_to_tsquery('english', ${term})`);

The left side is a stored, pre-tokenized column. @@ now matches against an indexed value, so the GIN index can answer it without touching every row. Same query, but the predicate shrank to a column reference, and that’s the performance win.

Ranking results with ts_rank and a tiebreaker

A search that returns matches in arbitrary order is only half-built. When ten invoices match consulting, the one whose description is about consulting should sit above the one that merely mentions it in passing. That’s ranking, and Postgres gives it to you: ts_rank scores a vector against a query, where higher means more relevant, factoring in how often the terms appear and how the document is weighted. (There’s also ts_rank_cd , the cover-density variant, which rewards matched terms that sit close together; reach for it when proximity matters.) You order by it descending.

But sorting by rank alone reintroduces a bug you’ve already met twice in this chapter. ts_rank produces ties: plenty of rows score identically, especially short ones. A list ordered only by a column with ties has no defined order within a tie, so the database is free to return those rows in a different sequence each time the query runs. Paginate that and rows reshuffle between pages, the same silent bug the CRUD and cursor-pagination lessons covered. The fix is the same one: pair the sort with the primary key to force a total order.

db.select()
  .from(invoices)
  .where(sql`${invoices.searchVector} @@ websearch_to_tsquery('english', ${term})`)
  .orderBy(
    desc(sql`ts_rank(${invoices.searchVector}, websearch_to_tsquery('english', ${term}))`),
    asc(invoices.id),
  );

There’s a smell in that version, though: websearch_to_tsquery('english', ${term}) now appears twice, once in the where and once inside ts_rank in the orderBy. Two copies of the same expression means two places to keep in sync when the term changes. Extract it to a const, exactly the way the aggregations lesson extracted a repeated sql fragment, and reference that const in both places. (The builder has no positional ORDER BY 1 shortcut, so the const is the way to share the expression.) Here’s the shape an experienced engineer would actually ship:

const queryExpr = sql`websearch_to_tsquery('english', ${term})`;

const results = await db
  .select()
  .from(invoices)
  .where(sql`${invoices.searchVector} @@ ${queryExpr}`)
  .orderBy(
    desc(sql`ts_rank(${invoices.searchVector}, ${queryExpr})`),
    asc(invoices.id),
  );

The query expression, extracted once. ${term} still binds as a $1 parameter; pulling the expression into a const changes nothing about parameterization, it just removes the duplication.

const queryExpr = sql`websearch_to_tsquery('english', ${term})`;

const results = await db
  .select()
  .from(invoices)
  .where(sql`${invoices.searchVector} @@ ${queryExpr}`)
  .orderBy(
    desc(sql`ts_rank(${invoices.searchVector}, ${queryExpr})`),
    asc(invoices.id),
  );

First use. The match predicate in the where references the const instead of inlining the websearch_to_tsquery(...) call.

const queryExpr = sql`websearch_to_tsquery('english', ${term})`;

const results = await db
  .select()
  .from(invoices)
  .where(sql`${invoices.searchVector} @@ ${queryExpr}`)
  .orderBy(
    desc(sql`ts_rank(${invoices.searchVector}, ${queryExpr})`),
    asc(invoices.id),
  );

Second use. ts_rank scores each matching row against that same query, and desc(...) puts the most relevant first: one source of truth for the expression, two call sites.

const queryExpr = sql`websearch_to_tsquery('english', ${term})`;

const results = await db
  .select()
  .from(invoices)
  .where(sql`${invoices.searchVector} @@ ${queryExpr}`)
  .orderBy(
    desc(sql`ts_rank(${invoices.searchVector}, ${queryExpr})`),
    asc(invoices.id),
  );

The tiebreaker. When two rows score equal rank, the primary key decides, giving a total, stable order that paginates deterministically. Same reflex as the CRUD and cursor lessons.

1 / 1

One typing note, the same caveat as every sql<T> you’ve met. ts_rank returns a float, and if you project it into your result you can claim its type with sql<number>`ts_rank(...)`. That’s a TypeScript-side claim, not a runtime check: you’re telling the compiler what shape to expect, and it trusts you. The full story on sql<T> is the raw-SQL lesson; one line is all you need here.

Highlighting matches with ts_headline

A good search UI shows the user why a row matched: the snippet of text with their terms emphasized, the way every search engine you’ve used does. Postgres has this too. ts_headline takes the config, the original text, and the query, and returns an excerpt with the matched words wrapped, by default in <b>…</b> and configurable through options like StartSel/StopSel and MaxWords. Add it to the projection:

const results = await db
  .select({
    id: invoices.id,
    customerName: invoices.customerName,
    snippet: sql<string>`ts_headline('english', ${invoices.description}, ${queryExpr})`,
  })
  .from(invoices)
  .where(sql`${invoices.searchVector} @@ ${queryExpr}`)
  .orderBy(
    desc(sql`ts_rank(${invoices.searchVector}, ${queryExpr})`),
    asc(invoices.id),
  );

The sql<string> claim on snippet is right, since ts_headline returns text. Two things to know so it doesn’t bite you:

ts_headline works on the raw text column (description), not the tsvector. The vector threw away the original words, so it has nothing to excerpt. ts_headline re-reads the source column to build its snippet, which means it isn’t free. Let it ride on the already-filtered, already-ranked result set (a handful of rows), and never compute it across the whole table.
The <b> tags it returns are real markup your UI has to render on purpose. In a React surface that means controlled rendering of trusted server markup, not dropping raw HTML into the page. That’s a render-layer concern for a later UI lesson; the point here is just to know it’s markup so you don’t blindly inject it. This lesson is the data layer.

When Postgres is enough, and when to leave it

Now that you know what Postgres full-text search actually does, return to the question we opened with. The decision is a threshold, not a verdict, and the threshold sits further out than newcomers expect.

Postgres full-text search is the default, and it stays the answer when your corpus is up to the low millions of documents, query rate is modest (a handful per second, not search-as-you-type at scale), relevance needs are “rank by match” rather than tuned boosting, and, the big one, the data already lives in Postgres. That last point is doing most of the work: no separate service to run, no second copy of your data to keep in sync, no consistency gap between “what’s in the database” and “what’s in the search index,” and transactional correctness for free. For the great majority of SaaS apps, that’s the entire story. You ship search, and you don’t take on a whole extra system to do it.

An external engine (Algolia, Meilisearch, Typesense, OpenSearch, Elastic) earns its weight when the corpus runs into many millions with high query throughput, when typo tolerance, synonyms, or “did you mean” are an actual product requirement, when you need faceted search across many fields, when relevance needs per-tenant boost rules or learning-to-rank, or when instant search-as-you-type latency is a hard budget. What you take on in exchange is real: a second datastore that every write has to fan out to, eventual consistency between Postgres and that index, and a new operational surface to run and pay for. That’s the trade: power in exchange for a sync pipeline and a second source of truth.

There’s an in-database step before you leave, though, and it’s worth knowing. pg_trgm does trigram similarity matching (similarity(a, b), the % operator, <-> distance), which handles typos and partial or substring matching, the exact thing full-text search’s lexeme model can’t do. If typo tolerance is your only gap, pg_trgm is the next thing to try, and it lives right next to full-text search in the same database. (Semantic or vector search over embeddings, finding documents by meaning rather than words, is a different tool again, pgvector, and it’s out of scope here; the AI unit near the end of the course covers it.)

Walk the decision yourself. Answer the questions in order, the way you’d ask them on a real project: scale first, then the kind of matching the product needs. Each path lands on what you’d actually reach for.

Search: Postgres, pg_trgm, or an external engine?

Practice: write a ranked search query

Time to write it yourself. The table below already has the generated search_vector column and a few invoices seeded for one organization. Your job is the search query: return the invoices for org 1 that match the term consulting, most relevant first.

You’ll need three things, all from this lesson: match the stored vector against websearch_to_tsquery('english', 'consulting') with @@, order by ts_rank(...) descending with id as the tiebreaker, and scope the whole thing to organization_id = 1 (the tenant where reflex you carry by hand on every query).

The seed is built to expose one thing: this is where you feel the difference between full-text search and the ILIKE '%consulting%' you might still be tempted to reach for. One invoice’s description says “We consult monthly”, which stems to consult, so the lexeme search matches it but ILIKE '%consulting%' would walk right past it. Another row says “Consultancy agreement”, a different word that neither approach should match. Get the stemming-aware query right and the correct rows fall out; reach for the substring habit and you’ll miss one.

Return invoices for organization 1 matching the search term 'consulting', most relevant first. Match against search_vector with websearch_to_tsquery, rank with ts_rank, tiebreak by id, and scope to organization_id = 1.

View schema & data

CREATE TABLE invoices (
  id int PRIMARY KEY,
  organization_id int NOT NULL,
  description text,
  customer_name text,
  search_vector tsvector GENERATED ALWAYS AS (
    to_tsvector('english', coalesce(description, '') || ' ' || coalesce(customer_name, ''))
  ) STORED
);
INSERT INTO invoices (id, organization_id, description, customer_name) VALUES
  (1, 1, 'Overdue invoice for consulting services', 'Northwind Traders'),
  (2, 1, 'Consulting retainer, paid in full', 'Acme Corp'),
  (3, 1, 'We consult monthly on infrastructure', 'Globex'),
  (4, 1, 'Hardware purchase, no services', 'Initech'),
  (5, 1, 'Consultancy agreement, draft', 'Umbrella'),
  (6, 1, NULL, 'Consulting Partners LLC');

The expected set is rows 1, 2, 3, and 6: the two descriptions that say “consulting” outright, the one that stems from “consult”, and the customer named “Consulting Partners LLC” (row 6 matches on customer_name even though its description is null, which is exactly why the coalesce in the generated column matters). Row 4 has no consulting anywhere, and row 5’s “Consultancy” stems to a different lexeme, so both are correctly excluded. The ranking and the id tiebreaker make the order deterministic, so the checklist matches exactly.

External resources

These cover the canonical patterns, the reference material, and the build-vs-buy comparison if you want to go deeper than this lesson.

Drizzle — PostgreSQL full-text search

orm.drizzle.team

Drizzle's official guide to wiring to_tsvector, @@, and a GIN index through the query builder.

PostgreSQL — Controlling Text Search

postgresql.org

Reference for websearch_to_tsquery, ts_rank, and ts_headline and all their options.

pg_trgm — trigram matching

postgresql.org

The in-database fuzzy-match escape hatch for typo tolerance, when lexeme search isn't enough.

Postgres full-text search vs the rest

supabase.com

Supabase benchmarks Postgres FTS against Elasticsearch, Typesense, and Meilisearch — the build-vs-buy line this lesson closes on.