Chapter 107Lesson 3

Embeddings and pgvector RAG

Ground AI answers in your own data with retrieval-augmented generation, using embeddings stored in Postgres via pgvector.

The chat surface you built with tools can answer “what’s the total on invoice #INV-203?” because a tool fetches that one row from the database and hands it back. The model never knew the answer; it knew how to ask for it. But now a user types “what’s our refund policy?” or “summarize the themes across my four thousand support tickets,” and the same machinery falls apart. There’s no single row to fetch. The answer lives in a body of text the model never trained on, such as your internal handbook or the customer’s own ticket archive, and that text is far too big to paste into the prompt on every turn.

The move is to pull only the relevant pieces of that body of text and feed them to the model as context for this one question. That is retrieval-augmented generation (RAG) , and it’s the subject of this lesson. Before any of the machinery, though, the more valuable thing to learn is when to reach for it, because the most common way this goes wrong is building the whole retrieval apparatus for a body of text that would have fit in a single system prompt. So we’ll lead with the decision, then build the one path that follows from it.

None of this is a new stack. RAG rides the same streamText route handler you already wrote, reads from the same Postgres the app already runs, and obeys the same session.orgId discipline as every other query. It is not a new system bolted onto your app; it is a query that enriches the system prompt before the model runs.

When retrieval earns its weight

Here is the threshold an experienced engineer applies. Two conditions must both hold before RAG pays for itself.

The first condition is that the corpus is internal, so the model can’t have trained on it. Your company handbook, a customer’s private knowledge base, the user’s own uploaded documents, last week’s call transcripts: none of that was in the model’s training data, so the model genuinely cannot know it. Public, well-trodden information is the opposite case. Ask any modern model what HTTP status code means “not found” and it answers correctly with no help from you. If the model already knows it, retrieval is wasted work.

The second condition is about size and freshness, and one of two things has to be true. Either the corpus is too large to fit the context window, like a three-hundred-page handbook; even where it technically does fit, you’d be paying to send all three hundred pages on every single message. Or the corpus is small but changes fast enough that recency matters, like a pricing table that updates weekly, which can’t be frozen into a system prompt you deployed last month.

Stay underneath that threshold and the disciplined move is the boring one: paste the text straight into the system prompt and don’t build anything else. Take a two-page returns policy that rarely changes. Put it in the prompt. Building a vector store for it is over-engineering, because you’ve added a database table, an indexing job, and a query path to solve a problem that a few hundred tokens of prompt already solved. Cross the threshold, with a large handbook, ten thousand tickets, or a knowledge base that grows daily, and now retrieval earns its weight.

The threshold sits where it does because of cost, the same discipline you met when capping token usage. Stuffing the same N tokens of context into every request bills you N tokens per call, forever, scaling linearly with traffic. Retrieval pays the embedding cost once when you index the text, then sends only a handful of relevant passages per question, a bounded and small amount of context no matter how big the corpus grows. RAG is a cost-and-fit decision, not an “is this an AI feature” decision.

One alternative deserves a name and nothing more: a long-context-window model can swallow a mid-size corpus whole, sidestepping retrieval. But the economics still favor retrieval as you scale, because the per-call cost compounds, context windows have a ceiling, and a model’s recall degrades when you bury the relevant sentence inside a very long prompt. For a real corpus at real traffic, retrieval wins.

That’s a lot of conditions to hold in your head at once, so walk them in the order an experienced engineer actually asks them. The following decision tool runs you through that order one question at a time.

Does this need RAG?

Notice what the walker rules out. RAG is not for the two things it’s most often confused with: a small corpus, which you paste into the prompt, and an exact-key lookup, which you handle with a tool. It earns its place only at the bottom-right leaf, a fuzzy question over a large or fast-moving body of internal text. Hold that picture, because everything mechanical from here serves it.

What an embedding is

To find the relevant passages, you need a way to measure meaning. That’s what an embedding gives you, and you can understand the whole idea without a line of math.

An embedding is a fixed-length array of floating-point numbers, a vector, produced by an embedding model . That’s a different kind of model from the chat models you’ve been calling: a chat model takes text and writes text back, while an embedding model takes text and hands back numbers. Feed it the same text twice and you get the identical vector both times.

One property makes those numbers useful: text with similar meaning maps to nearby vectors. “Invoice,” “bill,” and “receipt” land close together. “Office cat” lands far away. The model learned this during training, so “nearby” tracks meaning, not spelling, which is why “unpaid invoices” sits near “outstanding bills” even though they share no words.

“Nearby” needs a precise measure, and the standard one is cosine similarity : a single number where higher means more similar in meaning. There’s one inversion worth flagging now, because it will trip you up in the query later. Postgres doesn’t store similarity; it computes the inverse, cosine distance, which is simply 1 − similarity. So higher similarity is lower distance, which means closer vectors have smaller distance. When you write the query, you’ll order by distance ascending to get the most similar passages first. Keep that flip in mind.

So semantic search, end to end, is three steps: embed every passage in your corpus once, embed the user’s question when it arrives, then find the corpus vectors nearest to the question’s vector. Those nearest passages are your relevant context. That’s the whole idea.

The following figure makes “similar meaning means nearby vectors” tangible. It can’t show you a real embedding, since those have far too many dimensions to draw, but the intuition survives the cartoon.

A 2D cartoon of a 1536-dimensional space

invoice

bill

receipt

refund policy

office cat

unpaid invoices

stored text the question

Similar meanings sit close together; unrelated text sits far apart. The query lands among its nearest neighbours, and those are what we retrieve. Real embeddings have 1536 dimensions and can't be drawn, so this is only an intuition pump.

One number from this section feeds the next one. The fixed length of the vector is its dimensions , and you budget storage around it: the model this lesson uses, OpenAI’s text-embedding-3-small, produces 1536-dimensional vectors. Every passage you store is 1536 floating-point numbers. That number is about to show up in the database schema, where it has to match exactly.

The two SDK primitives: `embed` and `embedMany`

The AI SDK exposes embeddings through two functions, both imported from 'ai'. They’re the same idea at two scales.

embed({ model, value }) takes one string and returns one vector, alongside a usage count. You reach for it at query time, embedding the single question the user just asked.

embedMany({ model, values }) takes an array of strings and returns an array of vectors in the same order, plus the aggregate usage. You reach for it at index time, handing it the whole corpus in one call. If the array is larger than the provider accepts at once, it quietly splits the work into batches under the hood, so you don’t size the request yourself. That is a separate concern from chunking the documents, which is a decision you do make, and one we’ll return to shortly.

Both functions take a model handle, and that handle flows through the same registry as your chat models. Just as your chat models live behind role-named exports there, the embedding model gets its own named export: one line, one place to change it.

export const embeddingModel = 'openai/text-embedding-3-small';

That’s the same plain 'provider/model' AI Gateway string-id form your chat handles already use: the string itself routes through the gateway, with no provider package imported and no factory called. Recall the earlier rule that an imported, called provider for every generation is a fingerprint of stale v4 material. The direct-provider escape hatch is the factory form, openai.embeddingModel('text-embedding-3-small'). Reach for it only when you’re wired straight to one vendor and need an option the gateway string doesn’t surface; it is not the default.

One deliberate divergence is worth noting. The model-swaps chapter first seeded this embeddingModel handle with text-embedding-3-large, but this RAG lesson uses text-embedding-3-small. That’s not an oversight. text-embedding-3-small produces 1536-dimensional vectors, and 1536 stays under pgvector’s HNSW index limit of 2000 dimensions, whereas text-embedding-3-large’s 3072 dimensions would overflow it, so the fast similarity index this lesson is built around could not be used. The retrieval corpus is embedded with text-embedding-3-small for exactly that reason.

There’s a trap buried in that one line, and it’s worth flagging now. Chat model handles are freely swappable: point the export at a different provider tomorrow and yesterday’s conversations still make sense. Embedding handles are not. A different embedding model produces vectors in a different space, so an old vector and a new vector are no longer comparable, which means swapping the embedding model forces you to re-embed your entire corpus. We’ll come back to the operational weight of that at the end. For now, treat the embedding model as a far stickier choice than the chat model.

The two functions show their difference best side by side. The following two snippets are the same operation, text in and vectors out, at one item and at many.

Single — embed
Batch — embedMany

import { embed } from 'ai';
import { embeddingModel } from '@/lib/llm/models';

const { embedding } = await embed({
  model: embeddingModel,
  value: 'When do I get a refund?',
});

One string in, one vector out. This is the query-time call: you embed the user’s question to compare it against the stored corpus. embedding is a single number[].

import { embedMany } from 'ai';
import { embeddingModel } from '@/lib/llm/models';

const { embeddings } = await embedMany({
  model: embeddingModel,
  values: chunks,
});

Many strings in, many vectors out, in input order. This is the index-time call: chunks is an array of passages, and embeddings[i] is the vector for chunks[i]. The SDK auto-batches under the provider’s per-call limit.

One thing to internalize about the batch call: it belongs to an indexing job, not a request handler. You run it when a document is uploaded, or as a one-time backfill over existing documents, in a plain async function or a script, never inside the per-request path. Embedding ten thousand passages takes time and costs money, so you pay that once, offline, and the live chat never touches it.

Storing vectors: pgvector and the Drizzle `vector` column

Now you have vectors. Where do they live? The reflexive 2026 answer for a SaaS app is the Postgres you already run.

pgvector is a Postgres extension. It adds a vector column type and the similarity operators to query it. The reason to reach for it first is purely operational: your app already runs Postgres, so putting vectors in the same database means one fewer service to run, one fewer credential to rotate, and one fewer thing that can be down at 3am. You reach for a dedicated vector database, such as Pinecone, Upstash Vector, or Qdrant, only when the corpus genuinely outgrows what pgvector handles at your team’s scale, roughly tens of millions of vectors, or when the workload truly needs a managed service. For the corpus a typical SaaS feature retrieves over, that day rarely comes.

The schema is a single table. The following walkthrough builds it column by column, and each one earns its place.

export const documentChunks = pgTable(
  'document_chunks',
  {
    id: uuid().primaryKey().$defaultFn(() => uuidv7()),
    orgId: uuid().notNull(),
    documentId: uuid()
      .notNull()
      .references(() => documents.id, { onDelete: 'cascade' }),
    content: text().notNull(),
    embedding: vector({ dimensions: 1536 }).notNull(),
    embeddingModel: text().notNull(),
  },
  (t) => [
    index('idx_document_chunks_embedding').using(
      'hnsw',
      t.embedding.op('vector_cosine_ops'),
    ),
  ],
);

A normal Drizzle pgTable, nothing exotic. The UUIDv7 primary key is the same convention as every other table in the app. The snake_case SQL names like org_id come from the client’s casing: 'snake_case'.

export const documentChunks = pgTable(
  'document_chunks',
  {
    id: uuid().primaryKey().$defaultFn(() => uuidv7()),
    orgId: uuid().notNull(),
    documentId: uuid()
      .notNull()
      .references(() => documents.id, { onDelete: 'cascade' }),
    content: text().notNull(),
    embedding: vector({ dimensions: 1536 }).notNull(),
    embeddingModel: text().notNull(),
  },
  (t) => [
    index('idx_document_chunks_embedding').using(
      'hnsw',
      t.embedding.op('vector_cosine_ops'),
    ),
  ],
);

The tenancy column. Every chunk belongs to exactly one org. This is the column the retrieval query filters on, and getting that filter right is the most important thing in this lesson, so it gets its own section below.

export const documentChunks = pgTable(
  'document_chunks',
  {
    id: uuid().primaryKey().$defaultFn(() => uuidv7()),
    orgId: uuid().notNull(),
    documentId: uuid()
      .notNull()
      .references(() => documents.id, { onDelete: 'cascade' }),
    content: text().notNull(),
    embedding: vector({ dimensions: 1536 }).notNull(),
    embeddingModel: text().notNull(),
  },
  (t) => [
    index('idx_document_chunks_embedding').using(
      'hnsw',
      t.embedding.op('vector_cosine_ops'),
    ),
  ],
);

The raw passage text. This is what the query returns and injects into the prompt. The vector finds the row, but it’s this text the model actually reads.

export const documentChunks = pgTable(
  'document_chunks',
  {
    id: uuid().primaryKey().$defaultFn(() => uuidv7()),
    orgId: uuid().notNull(),
    documentId: uuid()
      .notNull()
      .references(() => documents.id, { onDelete: 'cascade' }),
    content: text().notNull(),
    embedding: vector({ dimensions: 1536 }).notNull(),
    embeddingModel: text().notNull(),
  },
  (t) => [
    index('idx_document_chunks_embedding').using(
      'hnsw',
      t.embedding.op('vector_cosine_ops'),
    ),
  ],
);

The pgvector column. vector({ dimensions: 1536 }) declares a 1536-dimensional vector, matching text-embedding-3-small’s output exactly. If these two numbers disagree, the insert fails.

export const documentChunks = pgTable(
  'document_chunks',
  {
    id: uuid().primaryKey().$defaultFn(() => uuidv7()),
    orgId: uuid().notNull(),
    documentId: uuid()
      .notNull()
      .references(() => documents.id, { onDelete: 'cascade' }),
    content: text().notNull(),
    embedding: vector({ dimensions: 1536 }).notNull(),
    embeddingModel: text().notNull(),
  },
  (t) => [
    index('idx_document_chunks_embedding').using(
      'hnsw',
      t.embedding.op('vector_cosine_ops'),
    ),
  ],
);

Records which model produced this vector. It looks redundant today, when everything uses one model, but it’s what makes a future re-index possible by letting you find exactly the rows that need new vectors.

export const documentChunks = pgTable(
  'document_chunks',
  {
    id: uuid().primaryKey().$defaultFn(() => uuidv7()),
    orgId: uuid().notNull(),
    documentId: uuid()
      .notNull()
      .references(() => documents.id, { onDelete: 'cascade' }),
    content: text().notNull(),
    embedding: vector({ dimensions: 1536 }).notNull(),
    embeddingModel: text().notNull(),
  },
  (t) => [
    index('idx_document_chunks_embedding').using(
      'hnsw',
      t.embedding.op('vector_cosine_ops'),
    ),
  ],
);

The HNSW index. It’s what makes similarity search fast once the table is large. We name it and move on, because the defaults are right for this course’s scale and tuning it is out of scope.

1 / 1

Two columns there are doing quiet, load-bearing work. The embedding column’s 1536 must match the embedding model’s output dimension, and it’s the first thing that breaks if you point the model at a different embedding model without re-embedding. The HNSW index uses the vector_cosine_ops operator class specifically: these embeddings are normalized, cosine is the right distance metric, and naming a mismatched operator class makes the migration fail. Those are the two places the schema can fail; everything else is ordinary Drizzle.

The two-phase pipeline: index, then query

RAG runs in two phases that happen at completely different times, and conflating them is the single most common source of confusion. Pull them apart and the whole thing gets simple.

The index phase is offline and batched. You take a document, split it into passages, embed all the passages with embedMany, and insert one row per passage into documentChunks, each row carrying its content, its embedding, its orgId, and the embeddingModel. This runs when a document is uploaded, or as a backfill. It happens before any user asks anything.

The query phase is online and per-request. When the question arrives, you embed it, run a similarity query for the nearest passages filtered to the user’s org, stitch those passages’ content into a context string, drop that into the system prompt, and call streamText. This happens every time a user sends a message.

The two phases never run together. They meet at exactly one place: the documentChunks table, which the index phase writes and the query phase reads. The following diagram walks the whole pipeline. Step through it and watch where the table sits in the middle.

Document

Chunker

embedMany

documentChunks

Index phase, offline. A document comes in and a chunker splits it into coherent passages, a few sentences to a paragraph each.

Document

Chunker

embedMany

documentChunks

embedMany turns every passage into a vector, all in one batched call.

Document

Chunker

embedMany

documentChunks

One row per passage is inserted into documentChunks, carrying its content, its embedding, its orgId, and its embeddingModel. Written once, offline.

Question

embed

documentChunks

system prompt

streamText

Answer

Query phase, now per request. The user’s question arrives and is embedded with embed into a single query vector.

Question

embed

documentChunks

system prompt

streamText

Answer

A similarity query reads documentChunks and returns the top-K nearest passages, scoped to this org. This table is the seam between the two phases: written by the index phase, read here.

Question

embed

documentChunks

system prompt

streamText

Answer

The retrieved passages enrich the system prompt, streamText runs, and the answer is grounded in your corpus.

Chunking: a decision, not a default

One step in the index phase deserves more than a node on a diagram: the chunker. The SDK does not ship one, and that’s deliberate, because how you split text is a problem-domain decision, not a setting. Prose splits well on paragraphs. Code splits on fixed token windows. Transcripts split on utterances or speaker turns. You either reach for a library like @langchain/text-splitters or hand-roll a splitter for your shape of text.

Two guardrails pull in opposite directions. Chunk too large, with whole pages per chunk, and a retrieved passage is mostly irrelevant text surrounding the one sentence that mattered, so the signal is buried and the answer drifts. Chunk too small, with a single sentence per chunk, and the passage loses the surrounding context it needs to make sense on its own. The sweet spot is a coherent passage, a few sentences to a paragraph, with a little overlap between neighbours so an idea split across a boundary survives in both. You won’t tune this to perfection on the first pass. Aim for “a human could answer the question from this passage alone” and adjust from there.

The index-phase code

Here’s the index phase as code, the three steps in order. It’s a plain async function you’d call on document upload, deliberately stripped to its spine.

export const indexDocument = async (doc: Document) => {
  const chunks = chunkDocument(doc.content);

  const { embeddings } = await embedMany({
    model: embeddingModel,
    values: chunks,
  });

  await db.insert(documentChunks).values(
    chunks.map((content, i) => ({
      orgId: doc.orgId,
      documentId: doc.id,
      content,
      embedding: embeddings[i],
      embeddingModel: 'openai/text-embedding-3-small',
    })),
  );
};

That’s the shape, not the whole story. Real ingestion adds deduplication, batching across many documents, and error handling around each step. Those concerns reuse the background-work patterns you already have, so we don’t rebuild them here. Deduplication in particular is a trap worth its own mention later.

The query-phase code

The similarity query is the most intricate snippet in the lesson, so step through it. This is the read side: embed the question, then find the nearest passages, scoped to the org.

export const findRelevantChunks = async (question: string, orgId: string) => {
  const { embedding } = await embed({
    model: embeddingModel,
    value: question,
  });

  const similarity = sql<number>`1 - (${cosineDistance(documentChunks.embedding, embedding)})`;

  return db
    .select({ content: documentChunks.content, similarity })
    .from(documentChunks)
    .where(eq(documentChunks.orgId, orgId))
    .orderBy(desc(similarity))
    .limit(5);
};

Embed the question into a query vector with embed. This is the one string we’re searching with.

export const findRelevantChunks = async (question: string, orgId: string) => {
  const { embedding } = await embed({
    model: embeddingModel,
    value: question,
  });

  const similarity = sql<number>`1 - (${cosineDistance(documentChunks.embedding, embedding)})`;

  return db
    .select({ content: documentChunks.content, similarity })
    .from(documentChunks)
    .where(eq(documentChunks.orgId, orgId))
    .orderBy(desc(similarity))
    .limit(5);
};

cosineDistance computes distance between the stored column and the query vector, and 1 - distance turns it back into similarity. Recall the inversion: distance is lower-is-closer, so similarity is higher-is-closer. We compute similarity here so the ordering reads naturally.

export const findRelevantChunks = async (question: string, orgId: string) => {
  const { embedding } = await embed({
    model: embeddingModel,
    value: question,
  });

  const similarity = sql<number>`1 - (${cosineDistance(documentChunks.embedding, embedding)})`;

  return db
    .select({ content: documentChunks.content, similarity })
    .from(documentChunks)
    .where(eq(documentChunks.orgId, orgId))
    .orderBy(desc(similarity))
    .limit(5);
};

The tenancy filter. eq(documentChunks.orgId, orgId) restricts the search to this org’s chunks and nothing else. This single line is the difference between a working feature and a cross-tenant leak, so it gets its own section below.

export const findRelevantChunks = async (question: string, orgId: string) => {
  const { embedding } = await embed({
    model: embeddingModel,
    value: question,
  });

  const similarity = sql<number>`1 - (${cosineDistance(documentChunks.embedding, embedding)})`;

  return db
    .select({ content: documentChunks.content, similarity })
    .from(documentChunks)
    .where(eq(documentChunks.orgId, orgId))
    .orderBy(desc(similarity))
    .limit(5);
};

Order by similarity descending and take the top 5. K is small on purpose: a large K drags irrelevant passages back into the prompt and reintroduces the very cost problem retrieval exists to solve.

1 / 1

The RAG query in a route handler

Now wire retrieval into the chat handler you already know. The whole change is a few lines that run before streamText.

The shape is this: take the user’s latest question, embed it, run the org-scoped similarity query, join the retrieved passages into one relevantContext string, and fold that into the system prompt. The user’s question stays where it always was, in messages, while the retrieved context goes into the system prompt. That split is not cosmetic, and the next paragraph explains why.

The following handler is the same one from earlier in the course with the retrieval block dropped in. Step through it.

export const POST = authedRoute('member', chatSchema, async ({ messages }, { session }) => {
  const lastMessage = messages.at(-1);
  const question = lastMessage?.parts
    .filter((part) => part.type === 'text')
    .map((part) => part.text)
    .join(' ');

  const chunks = await findRelevantChunks(question ?? '', session.orgId);
  const relevantContext = chunks.map((chunk) => chunk.content).join('\n\n');

  const result = streamText({
    model: smartModel,
    system: `You answer questions about the company handbook.
Answer only from the context below; if it is not there, say you don't know.

Relevant context:
${relevantContext}`,
    messages: convertToModelMessages(messages),
    stopWhen: stepCountIs(5),
    maxOutputTokens: 1024,
  });

  return result.toUIMessageStreamResponse();
});

The same wrapper every route uses, the same signature as the chapter’s composed handler: the validated body arrives as { messages }, and the context’s session carries session.orgId. That orgId is what scopes retrieval.

export const POST = authedRoute('member', chatSchema, async ({ messages }, { session }) => {
  const lastMessage = messages.at(-1);
  const question = lastMessage?.parts
    .filter((part) => part.type === 'text')
    .map((part) => part.text)
    .join(' ');

  const chunks = await findRelevantChunks(question ?? '', session.orgId);
  const relevantContext = chunks.map((chunk) => chunk.content).join('\n\n');

  const result = streamText({
    model: smartModel,
    system: `You answer questions about the company handbook.
Answer only from the context below; if it is not there, say you don't know.

Relevant context:
${relevantContext}`,
    messages: convertToModelMessages(messages),
    stopWhen: stepCountIs(5),
    maxOutputTokens: 1024,
  });

  return result.toUIMessageStreamResponse();
});

Pull the user’s latest question text and fetch its nearest chunks, org-scoped. This whole block runs before the model call, which is what makes it pre-retrieval.

export const POST = authedRoute('member', chatSchema, async ({ messages }, { session }) => {
  const lastMessage = messages.at(-1);
  const question = lastMessage?.parts
    .filter((part) => part.type === 'text')
    .map((part) => part.text)
    .join(' ');

  const chunks = await findRelevantChunks(question ?? '', session.orgId);
  const relevantContext = chunks.map((chunk) => chunk.content).join('\n\n');

  const result = streamText({
    model: smartModel,
    system: `You answer questions about the company handbook.
Answer only from the context below; if it is not there, say you don't know.

Relevant context:
${relevantContext}`,
    messages: convertToModelMessages(messages),
    stopWhen: stepCountIs(5),
    maxOutputTokens: 1024,
  });

  return result.toUIMessageStreamResponse();
});

The retrieved passages go into the system prompt, the trusted controller, never into messages. The retrieval was authorized server-side under session.orgId, so this text is trusted; the raw user turn in messages is not.

export const POST = authedRoute('member', chatSchema, async ({ messages }, { session }) => {
  const lastMessage = messages.at(-1);
  const question = lastMessage?.parts
    .filter((part) => part.type === 'text')
    .map((part) => part.text)
    .join(' ');

  const chunks = await findRelevantChunks(question ?? '', session.orgId);
  const relevantContext = chunks.map((chunk) => chunk.content).join('\n\n');

  const result = streamText({
    model: smartModel,
    system: `You answer questions about the company handbook.
Answer only from the context below; if it is not there, say you don't know.

Relevant context:
${relevantContext}`,
    messages: convertToModelMessages(messages),
    stopWhen: stepCountIs(5),
    maxOutputTokens: 1024,
  });

  return result.toUIMessageStreamResponse();
});

Unchanged from the chapter’s handler, and still non-negotiable. The step cap and the output cap are your cost guardrails whether or not retrieval is in play.

export const POST = authedRoute('member', chatSchema, async ({ messages }, { session }) => {
  const lastMessage = messages.at(-1);
  const question = lastMessage?.parts
    .filter((part) => part.type === 'text')
    .map((part) => part.text)
    .join(' ');

  const chunks = await findRelevantChunks(question ?? '', session.orgId);
  const relevantContext = chunks.map((chunk) => chunk.content).join('\n\n');

  const result = streamText({
    model: smartModel,
    system: `You answer questions about the company handbook.
Answer only from the context below; if it is not there, say you don't know.

Relevant context:
${relevantContext}`,
    messages: convertToModelMessages(messages),
    stopWhen: stepCountIs(5),
    maxOutputTokens: 1024,
  });

  return result.toUIMessageStreamResponse();
});

The same response the client already speaks. Nothing on the client changes, because retrieval is invisible to it.

1 / 1

This is the pre-retrieval pattern: you retrieve on every turn, before the model runs. It’s the simplest shape and the right default for a surface where nearly every message needs the corpus.

Two things deserve emphasis. The first is why the context goes in the system prompt and not in messages. The system prompt is the controller, holding the trusted instructions you wrote. The messages array carries untrusted user input, which is why you never let it dictate the model’s behaviour. Retrieved context belongs on the trusted side because the retrieval itself was authorized server-side, under session.orgId, by code you control. Putting it in the system prompt keeps the controller in charge: retrieval enriches the instructions rather than handing the reins to whatever happens to be in the corpus.

The second is a caveat experienced engineers internalize. Retrieval grounds the answer, but it does not validate it. RAG reduces the odds the model invents facts that contradict your corpus, but it does not make the corpus true, and it does not guarantee the answer is correct. If your handbook is wrong, a grounded answer will be confidently wrong. Treat retrieved text as the model’s source, not as an oracle, and don’t let the architecture lull you into trusting outputs you’d otherwise check.

Pre-retrieval vs retrieval as a tool

Pre-retrieval always fires. But on many surfaces, most questions don’t need the corpus at all: a general assistant fields “what’s the weather” and “summarize this thread” far more often than “what does the handbook say about X.” Embedding and querying on every one of those turns is wasted work.

So there’s a second architecture: make retrieval a tool. Define a searchKnowledgeBase tool whose execute does the embed-and-query you just wrote, and let the model decide when to call it through the same agentic loop you already use for every other tool. On turns that don’t need the corpus, the model simply doesn’t reach for it.

Here’s the cut between the two, stated plainly:

Pre-retrieval is the default when every turn likely needs the corpus, as in a docs Q&A bot or a “chat with this handbook” surface. It runs one query per turn, deterministically, with the least machinery.
Retrieval as a tool is the call for mixed surfaces, where some questions need the corpus and many don’t. It skips the embedding call and the query on the turns that don’t need them, and it lets the model fold retrieval into a loop alongside its other tools.

Both compose with everything you’ve already built. As a tool, the orgId filter lives inside execute, exactly where every other tool’s org-scope lives, and the result feeds back through the same stopWhen loop. The following two tabs show the two shapes against the same goal.

Pre-retrieval (every turn)
Retrieval as a tool (model decides)

const chunks = await findRelevantChunks(question, session.orgId);
const relevantContext = chunks.map((chunk) => chunk.content).join('\n\n');

const result = streamText({
  model: smartModel,
  system: `Answer from the context below.\n\nRelevant context:\n${relevantContext}`,
  messages: convertToModelMessages(messages),
  stopWhen: stepCountIs(5),
});

Use when every turn needs the corpus. Retrieval runs unconditionally before the model. This is the simplest and most deterministic shape, the condensed form of the worked handler above.

const searchKnowledgeBase = tool({
  description: 'Search the company handbook for relevant passages.',
  inputSchema: z.object({ query: z.string() }),
  execute: async ({ query }) => findRelevantChunks(query, session.orgId),
});

const result = streamText({
  model: smartModel,
  messages: convertToModelMessages(messages),
  tools: { searchKnowledgeBase },
  stopWhen: stepCountIs(5),
});

Use on mixed surfaces, where many turns don’t need the corpus. The model calls the tool only when it judges the question needs it. The orgId filter lives inside execute, same as every tool, and session.orgId is in scope from the handler.

One thing to watch before you pick: don’t run both on the same surface without deciding which fires first. If you pre-inject context and also hand the model a searchKnowledgeBase tool, it may double-retrieve, or get confused about whether it already has what it needs. Use one retrieval strategy per surface.

That decision, which approach fits which question, is the real payload of this lesson, more than any single line of code. Test yourself against the following scenarios.

Your support chat needs to answer questions from a two-page returns policy that’s barely been touched in a year. Which approach earns its weight?

Drop the whole policy text into the system prompt and ship — no vector store.

Index it in documentChunks and retrieve on every turn before the model runs.

Expose a getPolicy tool that fetches the row by id.

Define a searchKnowledgeBase tool and let the model decide when to consult it.

A user pastes the string INV-203 and wants that one invoice’s line items back. Which approach fits?

Embed the request and pull the nearest chunks on every turn.

Call a tool that fetches the invoice deterministically by its number.

Paste the entire invoice archive into the system prompt.

You’re building a “chat with our 300-page engineering handbook” bot where essentially every message is a question about the handbook. Which approach fits?

Concatenate all 300 pages into the system prompt on each request.

Wire a lookupSection tool that fetches a page by its heading.

Retrieve the nearest passages unconditionally, before the model runs, on every turn.

Hand the model a retrieval tool and let it decide whether each question needs the handbook.

A general support assistant fields all kinds of requests — weather, summaries, small talk — and only now and then needs to quote the handbook. Which approach fits?

Embed and query the corpus before every single message regardless of the question.

Inline the handbook in the system prompt so it’s always available.

Fetch a handbook row by id with a deterministic tool.

Expose retrieval as a tool and let the model call it only when a question needs the handbook.

Authorizing retrieval: the multi-tenant rule

This is short because it is absolute. Every documentChunks row carries an orgId. Every retrieval query must filter by session.orgId. No exceptions, no shortcuts.

You’ve enforced org-scoping on every query in the app, but it’s worth understanding why an unscoped retrieval query is worse than an ordinary leak. A normal query that forgets its tenant filter returns another org’s rows to your code, and that’s bad enough. But here the leaked rows don’t stop at your code: the model reads them and quotes them in its answer. One tenant’s private handbook, their internal pricing, or their customer’s tickets surface as natural-language prose inside another tenant’s chat, read by that user as if it were their own answer. The leak is laundered into fluent output the user has no reason to distrust. It is the worst shape a cross-tenant bug can take.

The rule travels with the query, wherever the query lives. For pre-retrieval, the filter is in the route handler. For retrieval-as-a-tool, it’s inside execute, the same place every tool puts its org-scope. This is the red line you stepped through in the similarity query, and it is not optional anywhere it appears.

Keeping the corpus fresh

A corpus is not a one-time upload. Three operational realities follow you for the life of the feature, and you should know them by name even though their mechanics reuse tools you already have.

Embedding models change. Move from text-embedding-3-small to a newer model and you must re-embed the entire corpus, because the new model’s vectors live in a different space and are incomparable to the old ones, and they may not even share the column’s dimension. This is the swap trap from earlier, now concrete. The embeddingModel column on each row exists precisely so this is survivable: you query for the rows still tagged with the old model and re-embed them in batches, rolling the corpus forward without a big-bang outage.

Documents change. When a source document is edited, you re-chunk and re-embed that document’s chunks. When it’s deleted, its chunks go with it, since the foreign-key cascade you put on documentId handles that for free.

Duplicates poison retrieval. Don’t insert the same passage twice. A duplicated chunk shows up multiple times in the top-K results, spending your small K budget on the same text and biasing the answer toward whatever got duplicated. Deduplicate at insert time. This is the trap flagged back at the index-phase code, and it’s the kind of thing that’s invisible until the answers start looking subtly repetitive.

None of these need new machinery. A re-index is a background job, and you already have the background-work toolkit to run one durably.

External resources

The canonical references for the exact APIs this lesson used, plus a tool to make the abstract part concrete.

AI SDK — Embeddings

ai-sdk.dev

The reference for embed, embedMany, and the embedding model handle.

Drizzle — Vector similarity search

orm.drizzle.team

The pgvector column helper, the HNSW index, and the cosineDistance query shape.

TensorFlow Embedding Projector

projector.tensorflow.org

Rotate and zoom a real embedding space in 3D, and watch nearest neighbours cluster by meaning, live.

pgvector

github.com

The extension itself: the vector column, the cosine operator class, and the HNSW index tuning knobs.

Embeddings and pgvector RAG

When retrieval earns its weight

What an embedding is

The two SDK primitives: embed and embedMany

Storing vectors: pgvector and the Drizzle vector column

The two-phase pipeline: index, then query

Chunking: a decision, not a default

The index-phase code

The query-phase code

The RAG query in a route handler

Pre-retrieval vs retrieval as a tool

Authorizing retrieval: the multi-tenant rule

Keeping the corpus fresh

External resources

The two SDK primitives: `embed` and `embedMany`

Storing vectors: pgvector and the Drizzle `vector` column