Chapter 108Lesson 4

The per-user daily token quota

The chat answers grounded questions now. It reads real invoice aggregates, refuses cross-org probes, and writes an audit trail. There is one thing it still does not do, and it is the thing that separates a demo from something you would put behind a public URL: it has no idea how much it costs. Every message a user sends spends tokens, and tokens are money. A single user — or a single runaway script pretending to be one — can sit in that chat box all afternoon and run up a model bill with nothing standing in the way.

Your goal in this lesson is to cap each user at 100,000 tokens per day, so no one person can run up an unbounded model bill, and to refuse gracefully the moment the budget is spent. When it works, the request that would cross the cap comes back as a typed 429 refusal instead of an answer — the model never runs, and no new 'llm.finish' row lands in the audit tail — and a new GET /api/usage endpoint reports today’s used, cap, and remaining tokens for whoever is asking. There is no screenshot to chase: this lesson’s whole surface is server-side. The visible proof is the inspector’s live quota counter ticking up after a normal question, and the panel that renders it is the next lesson’s job. For now the proof is the 429 and the JSON.

Your mission

This is the cost-cap discipline from Bounding spend before the surface goes public made concrete: a per-user-per-day token budget, enforced server-side, with a typed refusal the client can render. That lesson argued why a public LLM surface needs a spend ceiling and what a quota buys you. This one builds the ceiling for real, so reach back there for the reasoning rather than re-deriving it here.

The hard part is where the reservation runs, and it is the reflex worth installing from this lesson. The check that decides “is this user over budget?” has to happen before streamText spends a single token — a check that runs after the stream has started is theater. That is why it does not live inside the route handler. It lives in a middleware, withLlmQuota, composed around authedRoute: the route is exported as withLlmQuota(authedRoute('member', …)), so the wrapper sits between the incoming request and the handler. Wrap first, then add capability. The structural payoff is that a future LLM route physically cannot forget cost enforcement, the same way it cannot forget auth — both are wrappers you compose on, not lines you remember to write inside the body. The wrapper itself ships complete in the starter as a provided seam; your job is to author the quota module it calls and to wire it onto the route. You are building the engine and turning the key, not casting the key.

The quota lives in the usageQuota store array — the in-memory stand-in for a usage_quota_daily table — keyed by (userId, day). That key is the entire daily-reset mechanism, and it is worth pausing on because there is no cron job, no scheduled wipe, nothing that “resets” anything. Tomorrow simply has a different day value, so the first request tomorrow finds no row for (userId, tomorrow) and pushes a fresh one starting at zero. The seed proves this directly: it puts one near-cap row for today and a separate near-cap row for yesterday, both for the same user. Yesterday’s row at 99,000 tokens does not block today, because it is a different key. Name the SQL lineage as you build it — ensure-then-compare is INSERT ... ON CONFLICT DO NOTHING followed by a SELECT against usage_quota_daily (primary key (userId, day)); here two array operations stand in for those two statements, chosen for readability over collapsing them into a single INSERT ... ON CONFLICT DO UPDATE ... RETURNING.

There is a trade-off here you should name out loud rather than paper over: this is a soft daily ceiling, enforced as you go. The actual token charge happens in arrears, inside onStepFinish, as each step’s tokens are consumed — so the step that pushes a request over the cap is charged after it ran. The next request gets refused, but the one that crossed the line was already paid for. For a 100,000-token daily budget that is perfectly acceptable; the overshoot is bounded by one request’s worth of tokens. It is not a hard rate limit, and if you needed one — say, a strict per-second quota on an expensive model — you would pre-reserve a budgeted amount up front instead of charging as-you-go. Some teams do exactly that. Know that the alternative exists; the as-you-go ceiling is the right default for this surface.

One more simplification to name: the counter sums input and output tokens into a single number. Production billing often separates them, because output tokens cost several times what input tokens do, and a real cost dashboard wants the two priced apart. Both fields are optional on the v5 usage object — a step can report a partial usage — so default each to zero with ?? before you add them. A missing field must degrade to “charged nothing for that field,” never crash the route. The /api/usage endpoint you also build is the read side the usage panel will poll next lesson; it reuses authedRoute('member') so the usage it reports is always scoped to the authenticated caller, never a user id from the request.

The quota module, the wrapper wiring, and the read endpoint are the entire surface here. The rendered usage panel and the typed client that consumes it are out of scope and land in the next lesson.

The request that crosses the 100,000-token cap returns HTTP 429 with { ok: false, error: { code: 'quota_exceeded', userMessage } }, refused before the stream starts.

untested

GET /api/usage returns today’s { used, cap, remaining } for the acting user.

untested

After a normal question, the usage counter increases by the actual token count for that conversation.

untested

A question with a long preamble increases the counter by more than a short question, because the counter sums input and output tokens.

untested

Yesterday’s near-cap row does not block today’s request, because the quota is keyed by (userId, day).

untested

The slice typechecks and builds (pnpm verify).

tested

Coding time

Build the quota module, then the read endpoint, then wire the wrapper onto the chat route and add the increment to its onStepFinish — against the brief above and the verification below. Try it before opening the walkthrough.

Reference solution and walkthrough

We’ll build in the order the pieces depend on each other: the quota module first (everything else calls into it), then the /api/usage endpoint, then the two surgical edits that wire the wrapper and the counter onto the chat route you already wrote.

The quota module

Everything the feature does to a quota row lives in one file, src/lib/llm/quota.ts, with four exports: the cap constant, a read, a reserve, and a charge. Read it top to bottom — the four comments carry the rationale for each move.

import 'server-only';

import { findQuotaRow, todayUtc, usageQuota } from '@/server/store';

export const DAILY_TOKEN_CAP = 100_000;

export type UsageReport = {
  used: number;
  cap: number;
  remaining: number;
};

export type QuotaReservation =
  | { ok: true }
  | { ok: false; error: { code: 'quota_exceeded'; userMessage: string } };

// Find today's row, or push a fresh `tokensUsed: 0` one — the in-memory analogue
// of `INSERT ... ON CONFLICT DO NOTHING`. Reservation and accounting both ensure
// the row exists before touching it.
const ensureTodayRow = (userId: string) => {
  const existing = findQuotaRow(userId, todayUtc());
  if (existing) {
    return existing;
  }

  const row = {
    userId,
    day: todayUtc(),
    tokensUsed: 0,
    updatedAt: new Date().toISOString(),
  };
  usageQuota.push(row);
  return row;
};

// Today's used/cap/remaining — the shape `/api/usage` returns and the panel
// polls. Missing row reads as zero used.
export const readUsage = async (userId: string): Promise<UsageReport> => {
  const used = findQuotaRow(userId, todayUtc())?.tokensUsed ?? 0;
  return {
    used,
    cap: DAILY_TOKEN_CAP,
    remaining: Math.max(0, DAILY_TOKEN_CAP - used),
  };
};

// Reserve before the stream spends — runs in `withLlmQuota` before delegating.
// At or over the cap, refuse with a typed 429-shaped error the wrapper returns;
// otherwise the call proceeds and `addUsage` charges in arrears. Ensure-then-
// compare keeps the two steps readable.
export const reserveQuotaOrRefuse = async (
  userId: string,
): Promise<QuotaReservation> => {
  const row = ensureTodayRow(userId);

  if (row.tokensUsed >= DAILY_TOKEN_CAP) {
    return {
      ok: false,
      error: {
        code: 'quota_exceeded',
        userMessage: "You've reached today's usage limit. Try again tomorrow.",
      },
    } as const;
  }

  return { ok: true } as const;
};

// Charge tokens as they are consumed — runs per step in the route's
// `onStepFinish`. A soft daily ceiling: charged in arrears, so a single request
// can push slightly past the cap before the next reservation refuses. Input and
// output tokens are summed into one number (production separates the two prices).
export const addUsage = async (
  userId: string,
  tokens: number,
): Promise<void> => {
  const row = ensureTodayRow(userId);
  row.tokensUsed += tokens;
  row.updatedAt = new Date().toISOString();
};

The shape worth noticing is ensureTodayRow, the private helper both reserveQuotaOrRefuse and addUsage lean on. It is the find-or-push that stands in for INSERT ... ON CONFLICT DO NOTHING: look up the (userId, todayUtc()) row, and if it is not there, push one starting at zero. Reserving and charging both call it first, so neither has to worry about whether the row exists yet — by the time they touch tokensUsed, it does. Note that it keys on todayUtc(), so the row it ensures is always today’s row. Yesterday’s seeded 99,000 row is a different key; this helper never sees it. That is requirement 5 falling straight out of the data model — the daily reset is just the key changing, no scheduled wipe anywhere.

reserveQuotaOrRefuse is requirement 1. It ensures today’s row, then compares tokensUsed to DAILY_TOKEN_CAP with >=, and at or over the cap it returns the canonical Result error shape — { ok: false, error: { code: 'quota_exceeded', userMessage } }, the same Result discipline the action wrappers return, built on Zod’s error contract and standardized in the authedAction wrapper. Note what it does not do: it does not return a Response, and it does not set a status code. It hands a typed verdict back to the wrapper, and the wrapper turns the refusal into the 429. The route never calls reserveQuotaOrRefuse itself — the only caller is withLlmQuota.

readUsage is requirement 2, the read side. It reads findQuotaRow(userId, todayUtc())?.tokensUsed ?? 0 — a missing row reads as zero used, never undefined — and returns { used, cap, remaining }, with remaining floored at zero through Math.max(0, …) so a user who overshot the soft cap sees remaining: 0 rather than a negative number.

addUsage is requirements 3 and 4, the charge. It ensures today’s row and does row.tokensUsed += tokens — the in-memory UPDATE … SET tokens_used = tokens_used + $tokens. It takes a single tokens number; the summing of input and output happens at the call site in onStepFinish, which we get to below.

The provided seam: `with-llm-quota.ts`

You do not write this file — it ships complete in the starter — but you wire it on, so read it and understand its three moves. This is the structural payoff the whole lesson is about.

import 'server-only';

import { reserveQuotaOrRefuse } from '@/lib/llm/quota';
import { getSession } from '@/server/session';

// The daily-quota seam, composed AROUND `authedRoute` — `withLlmQuota(authedRoute(...))`.
// Quota lives here, not inside the route, so a new LLM route cannot forget cost
// enforcement: wrap first, then add capability. It reserves before the stream
// starts (reserve-before-spend) and short-circuits a typed 429 when the user is
// at or over the cap; otherwise it delegates to the wrapped handler untouched.
export const withLlmQuota =
  (handler: (req: Request) => Promise<Response>) =>
  async (req: Request): Promise<Response> => {
    const session = await getSession();
    const reserved = await reserveQuotaOrRefuse(session.userId);

    if (!reserved.ok) {
      return Response.json(
        { ok: false, error: reserved.error },
        { status: 429 },
      );
    }

    return handler(req);
  };

It is a higher-order function: it takes a handler (req) => Promise<Response> and returns a new handler of the same shape. The three moves are: resolve the acting user from getSession() (the same cookie-driven session every other route reads — the user id comes from the server, never the request body), call reserveQuotaOrRefuse(session.userId), and then branch. If the reservation refused, it short-circuits with Response.json(..., { status: 429 }) and the inner handler never runs — the model never starts. Otherwise it delegates to handler(req) untouched, and the chat streams as normal.

Because it returns a function of the exact same signature authedRoute returns, the two compose cleanly: withLlmQuota(authedRoute(...)). The request flows quota → auth → handler, and the reservation is structurally guaranteed to run before the handler ever touches streamText. That is the “can’t forget” property — cost enforcement is a layer the request passes through, not a line in the body someone has to remember.

The usage endpoint

src/app/api/usage/route.ts is three lines. It is the read side the panel will poll.

import { z } from 'zod';
import { authedRoute } from '@/lib/authed-route';
import { readUsage } from '@/lib/llm/quota';

// The usage endpoint the token panel polls. GET carries no body, so it parses
// against `z.strictObject({})` (the wrapper treats an absent body as `{}`); the
// auth wrap resolves the acting user from the session closure, never the request.
export const GET = authedRoute(
  'member',
  z.strictObject({}),
  async (_input, ctx) => Response.json(await readUsage(ctx.userId)),
);

The one thing that looks odd here is the empty schema. A GET carries no request body, so there is nothing to validate — but authedRoute still wants a schema. z.strictObject({}) is the schema for “an empty object, and no extra keys,” and authedRoute treats an absent body as {}, so the parse passes. The handler reads ctx.userId — resolved by the auth wrapper from the session, never from anything the caller sent — and returns readUsage(ctx.userId) as JSON. That is requirement 2: usage is always per authenticated user.

Wire the wrapper and the counter onto the chat route

Two surgical changes to src/app/api/chat/route.ts, the route you already wrote in the previous lessons. Nothing else in the file moves.

Before (where lesson 3 left it)
After (this lesson)

import { convertToModelMessages, stepCountIs, streamText } from 'ai';
import { z } from 'zod';
import { authedRoute } from '@/lib/authed-route';
import { writeLlmFinishEvent, writeLlmStepEvent } from '@/lib/llm/audit';
import { chatModel } from '@/lib/llm/models';
import { invoiceQAPrompt } from '@/lib/llm/prompts';
import { buildInvoiceTools, type InvoiceUIMessage } from '@/lib/llm/tools';

export const POST = authedRoute(
  'member',
  z.strictObject({ messages: z.array(z.unknown()) }),
  async (input, ctx) => {
    const org = await ctx.db.query.organization.findFirst({
      where: (o) => o.id === ctx.orgId,
    });
    const orgName = org?.name ?? 'your organization';

    const tools = buildInvoiceTools({ orgId: ctx.orgId });

    const result = streamText({
      model: chatModel,
      system: invoiceQAPrompt({ orgName }),
      messages: convertToModelMessages(input.messages as InvoiceUIMessage[]),
      tools,
      stopWhen: stepCountIs(5),
      maxOutputTokens: 1024,
      onStepFinish: async ({ usage, toolCalls, finishReason }) => {
        await writeLlmStepEvent({
          userId: ctx.userId,
          orgId: ctx.orgId,
          finishReason,
          usage,
          toolCalls,
        });
      },
      onFinish: ({ usage, finishReason }) =>
        writeLlmFinishEvent({
          userId: ctx.userId,
          orgId: ctx.orgId,
          finishReason,
          usage,
        }),
      onError: ({ error }) => {
        console.error('[chat] stream error', { code: 'stream_error' });
        void error;
      },
    });

    return result.toUIMessageStreamResponse();
  },
);

The grounded-tool route, but nothing caps cost yet. The export is a bare authedRoute, and onStepFinish only writes the per-step audit row — no reservation runs in front of the stream and no token counter is charged.

import { convertToModelMessages, stepCountIs, streamText } from 'ai';
import { z } from 'zod';
import { authedRoute } from '@/lib/authed-route';
import { writeLlmFinishEvent, writeLlmStepEvent } from '@/lib/llm/audit';
import { chatModel } from '@/lib/llm/models';
import { invoiceQAPrompt } from '@/lib/llm/prompts';
import { addUsage } from '@/lib/llm/quota';
import { buildInvoiceTools, type InvoiceUIMessage } from '@/lib/llm/tools';
import { withLlmQuota } from '@/lib/llm/with-llm-quota';

// The streaming chat endpoint. `withLlmQuota` wraps `authedRoute` (quota composed
// AROUND auth — cost enforcement can't be forgotten); the inner handler owns the
// loop with a server-side `stopWhen` cap and a `maxOutputTokens` ceiling, both
// non-negotiable. The schema accepts untyped `messages` on purpose —
// `convertToModelMessages` does the real validation; the route does not duplicate it.
export const POST = withLlmQuota(
  authedRoute(
    'member',
    z.strictObject({ messages: z.array(z.unknown()) }),
    async (input, ctx) => {
      const org = await ctx.db.query.organization.findFirst({
        where: (o) => o.id === ctx.orgId,
      });
      const orgName = org?.name ?? 'your organization';

      const tools = buildInvoiceTools({ orgId: ctx.orgId });

      const result = streamText({
        model: chatModel,
        system: invoiceQAPrompt({ orgName }),
        messages: convertToModelMessages(input.messages as InvoiceUIMessage[]),
        tools,
        stopWhen: stepCountIs(5),
        maxOutputTokens: 1024,
        onStepFinish: async ({ usage, toolCalls, finishReason }) => {
          await addUsage(
            ctx.userId,
            (usage.inputTokens ?? 0) + (usage.outputTokens ?? 0),
          );
          await writeLlmStepEvent({
            userId: ctx.userId,
            orgId: ctx.orgId,
            finishReason,
            usage,
            toolCalls,
          });
        },
        onFinish: ({ usage, finishReason }) =>
          writeLlmFinishEvent({
            userId: ctx.userId,
            orgId: ctx.orgId,
            finishReason,
            usage,
          }),
        onError: ({ error }) => {
          console.error('[chat] stream error', { code: 'stream_error' });
          void error;
        },
      });

      return result.toUIMessageStreamResponse();
    },
  ),
);

Two load-bearing changes. The export is now wrapped in withLlmQuota, so the reservation runs before the handler; and onStepFinish charges the daily counter with the summed input + output tokens (the two new imports addUsage and withLlmQuota carry them) before writing the audit row.

The first change is the wrap: the export const POST = authedRoute(...) from the previous lesson becomes export const POST = withLlmQuota(authedRoute(...)). That one wrapping is the entire reservation path — the route handler stays exactly as it was, and the quota check now runs in front of it, before any token is spent. The route never reasons about quota; the wrapper does.

The second change is one statement inside the existing onStepFinish, added ahead of the audit write you put there in the previous lesson:

await addUsage(
  ctx.userId,
  (usage.inputTokens ?? 0) + (usage.outputTokens ?? 0),
);

This is requirements 3 and 4. onStepFinish fires once per step of the agentic loop, and each step reports its token usage, so charging here accumulates the real cost of the whole conversation across however many steps it took. The summed expression (usage.inputTokens ?? 0) + (usage.outputTokens ?? 0) is doing two jobs at once: it folds input and output into the single number addUsage charges, and it defends against the v5 usage object’s optional fields. Both inputTokens and outputTokens can be undefined on a partial usage report — ?? 0 makes a missing field contribute nothing rather than turn the whole sum into NaN and corrupt the counter. A question with a long preamble runs more input tokens through the model than a terse one, so it charges more here — that is requirement 4 falling out of summing both sides.

That is the whole change to the route: a wrapper around the export and one line inside onStepFinish. Everything else — the system prompt, the tools, the step cap, the output ceiling, the two audit writes — is untouched from the previous lessons.

streamText reference (AI SDK v5)

ai-sdk.dev

The onStepFinish callback and the usage object — note inputTokens and outputTokens are typed number | undefined, which is why your charge sums them with `?? 0`.

Token usage and multi-step generation

ai-sdk.dev

How per-step usage accumulates across an agentic loop — the model behind charging in onStepFinish rather than once at the end.

Moment of truth

This project ships no per-lesson test suite — the lesson-verification/ directory is a harness slot, not a green gate, and there is no pnpm test:lesson 4 for this chapter. What guards the slice mechanically is pnpm verify, which chains three checks and stops at the first failure: biome ci ., then tsc --noEmit, then a next build with SKIP_ENV_VALIDATION=true. Run it:

pnpm verify

Biome and tsc are silent on success, so a clean run flows straight through to the next build summary — compiled, type-checked, and the route map printed, with /api/chat and /api/usage listed:

 ✓ Compiled successfully
 ✓ Linting and checking validity of types
 ✓ Collecting page data
 ✓ Generating static pages

Route (app)                                 Size
┌ ƒ /api/chat                                0 B
├ ƒ /api/usage                               0 B
└ ○ /invoices                              ...

That confirms requirement 6 — the quota module, the wrapper wiring, and the endpoint typecheck and build. Everything behavioral is a live check you confirm by hand, and there is one trap in the setup worth stating up front: the seeded near-cap quota row is for user-acme-member, but the default session is org-acme:admin — a different user. The quota checks only fire once you switch the inspector identity to org-acme:member. If you force the quota or look for the 429 while acting as the admin, you are looking at the wrong user’s quota row and nothing will happen. Switch identity first.

A note on what needs a real model and what doesn’t: the live checks call the model, which needs AI_GATEWAY_API_KEY set in .env. But the 429 refusal and the daily-key independence short-circuit before the model — the wrapper refuses the over-cap request without ever calling streamText — so those two checks work with no key at all. Only the counter-increment checks need a real model call, because there are no real tokens to count until the model runs.

Switch the inspector identity to org-acme:member (the seed’s near-cap row is for that user, not the default admin). Apply “Force quota to 99,500”, then ask one small question — the next POST /api/chat returns 429 with the quota_exceeded Result shape, and no new 'llm.finish' row appears in the audit tail (the model never ran). Works without an API key.

untested

GET /api/usage returns { used, cap, remaining } for the acting user.

untested

After “Reset and re-seed”, acting as org-acme:member, asking one question ticks the inspector’s usage counter up from the seeded 90,000 baseline by the conversation’s actual token count, and the audit payload’s usage.inputTokens + usage.outputTokens matches the delta. Needs an API key.

untested

The same question with a long preamble increases the counter by more than the short question did. Needs an API key.

untested

The seed’s “yesterday” 99,000 row does not block today’s request: acting as org-acme:member, today’s row starts at the seeded 90,000, independent of yesterday’s 99,000. Works without an API key.

untested

The per-user daily token quota

Your mission

Coding time

The quota module

The provided seam: with-llm-quota.ts

The usage endpoint

Wire the wrapper and the counter onto the chat route

Moment of truth

The provided seam: `with-llm-quota.ts`