Chapter 66Lesson 5

Surviving crashes: retries, waits, and idempotency keys

How Trigger.dev makes a long-running background task crash-proof through checkpoints, declarative retries, idempotency keys, and durable waits.

Picture a task that exports an organization’s invoices. It runs for about ten minutes, paging through the database five hundred rows at a time, and emails a download link when it finishes. Now picture the things that go wrong in production, because they will. At minute eight, the platform redeploys your code and recycles the worker out from under the running job. Or the box runs out of memory and gets killed. Or the third invoice page hits a downstream service that returns 429 Too Many Requests because you’ve been hammering it.

Three plain questions fall out of that, and together they are the whole lesson. What does Trigger.dev handle for you automatically, with no code on your part? What is left for you to handle? And, the question behind the most production incidents, where do the bugs hide in the seam between those two?

By the end of this lesson you can write a multi-step task that survives every one of those failures: the redeploy, the kill, the 429. It resumes from where it died instead of starting over, and it never sends the “your export is ready” email twice or re-charges a customer because a retry replayed a side effect. In the previous lesson you learned the SDK surface that lets you define, type, trigger, and queue a task. This lesson is what turns one of those tasks from defined into crash-proof.

Hold onto one idea the whole way through, because every section is a different view of it: durability lives in the seams between steps, not inside them. Once that clicks, retries, idempotency keys, and durable waits stop being three APIs to memorize and become one mental model with three knobs. The shape you build here is the one the CSV export project ships in the next chapter, so treat this lesson as that project’s engine.

Durability is a property of the seams, not the steps

Start with what you’re actually being promised. A durable run is a run that survives the worker dying. Redeploy your code mid-run, OOM-kill the box, let the platform recycle the machine on a whim, and the run picks back up and finishes. You met that word, durable, by name when you decided this work belonged on Trigger.dev at all. Here is the machinery underneath it.

The runtime serializes your run’s state, meaning where it is and what it has produced so far, and writes that snapshot down at specific moments. Each snapshot is a checkpoint . It checkpoints at exactly three moments: at every await wait.* call, at every await *.triggerAndWait call, and at the end of every attempt. The worker is the compute process running your task, a different process from the Vercel function that kicked it off. When that worker dies, the runtime spins up a new worker, rehydrates it from the last checkpoint, and continues from there. Nothing you did before that checkpoint runs again, and everything after it does.

Read that last sentence again, because it has a sharp consequence that catches people. The checkpoints sit between operations, never in the middle of one, so the code inside a single stretch of work is not snapshotted line by line. Suppose your task body is one nine-minute synchronous loop, crunching numbers and transforming rows, with no await wait.* and no triggerAndWait anywhere inside it. If the worker dies at minute five, there is no checkpoint at minute five to resume from. The last checkpoint was the start, so the whole loop runs again from zero. That task runs on a durable platform but is not, in any useful sense, durable.

The fix is the design principle this entire lesson serves: split long work into small steps separated by a checkpoint. You put a wait or a triggerAndWait between the pieces, and now each completed piece is a save point you never have to re-cross. Page one writes, checkpoint. Page two writes, checkpoint. Crash on page seven, and pages one through six stay done. Durability isn’t something the runtime sprinkles over your code; it follows from where you place the seams, and placing them is your job.

Scrub through the sequence below. It walks one run from start to finish with a crash in the middle, so you can watch “resume from the last checkpoint” actually happen.

Worker · running

Step A — write page 1

executing

checkpoint after page 1

Step B — write page 2

waiting

checkpoint after page 2

The run starts on a worker. Step A, writing page 1, is executing. The checkpoint after it is still dim, because that boundary hasn't been crossed yet.

Worker · running

Step A — write page 1

done

checkpoint after page 1

Step B — write page 2

waiting

checkpoint after page 2

Step A finishes and the runtime writes a checkpoint. That boundary is now crossed, so page 1 is durably done.

Worker · running

Step A — write page 1

done

checkpoint after page 1

Step B — write page 2

executing

checkpoint after page 2

The run moves past the checkpoint into Step B, writing page 2, which is now executing.

Worker · dead

Step A — write page 1

done

checkpoint after page 1

Step B — write page 2

aborted

checkpoint after page 2

The worker dies mid-Step-B, whether from a redeploy, an OOM kill, or a platform recycle. Step B is aborted, but checkpoint A is still lit.

New worker · running new

Step A — write page 1

skipped — cached

checkpoint after page 1

Step B — write page 2

executing

checkpoint after page 2

A new worker spins up and rehydrates from checkpoint A. Step A is not re-run; it returns its cached result, greyed out. Step B re-executes from the top.

Worker · running

Step A — write page 1

done

checkpoint after page 1

Step B — write page 2

done

checkpoint after page 2

Step B completes, a checkpoint is written, and the run finishes. In one line: completed steps are never re-run, and the step in flight at crash time is the one that repeats.

Carry the last panel out of this section: split long work at wait and triggerAndWait boundaries, where each boundary is a save point. Everything that follows builds on it. Retries decide what happens when a step throws. Idempotency keys make the step that does re-run safe to re-run. Waits are how you create the boundaries in the first place. Three knobs, one model.

Retries are the runtime’s job, declared not coded

Start with the failure this prevents. A transient blip, such as a briefly overloaded downstream service, a network hiccup, or a database connection reset, kills a run that would have succeeded thirty seconds later if anyone had simply tried again. You do not want to write that “try again” logic by hand. The runtime owns it, and you configure it once.

You declare retries as a config block on the task. The shape is small enough to read at a glance.

export const exportInvoices = schemaTask({
  id: 'export-invoices',
  retry: {
    maxAttempts: 5,
    factor: 1.8,
    minTimeoutInMs: 1_000,
    maxTimeoutInMs: 60_000,
    randomize: true,
  },
  run: async (payload, { ctx }) => {
    // ...export the invoices...
  },
});

Each field controls one dimension of the back-off curve. maxAttempts is the total number of tries, including the first, so 5 means the original run plus four retries. factor is the exponential multiplier between waits: with 1.8, each retry waits roughly 1.8× as long as the one before. minTimeoutInMs is the floor for the first wait, and maxTimeoutInMs the ceiling no wait exceeds; past the cap, retries keep firing but stop spreading further apart. randomize: true adds jitter to each wait.

That last field matters more than it looks. You met exponential backoff with jitter by name earlier; here is the concrete reason jitter is non-negotiable. Imagine a downstream API falls over and a thousand of your runs all fail at the same instant. Without jitter, every one of them computes the identical back-off and retries at the exact same moment, a thundering herd that knocks the service straight back down the instant it recovers. Jitter scatters those thousand retries across a window so the recovering service sees a trickle, not a wall. You almost always want it on.

Here is the discipline that makes this work: retries are declarative, not imperative. A throw inside your task triggers a retry, on the configured back-off, for free. Do not wrap an external call in a try/catch with your own retry loop. The runtime already owns retrying, so a second hand-rolled layer on top multiplies with it: “five attempts” becomes twenty-five, and your carefully tuned back-off curve dissolves into noise. Let it throw. If the body genuinely needs to know which attempt it’s on, to log it or to branch on the final try, ctx.attempt.number tells you, but reaching for it should be rare.

Retry on transients, abort on permanents

Now the decision that earns its keep in code review. Every throw retries by default, and that is the right default, but not every failure deserves a retry. Some failures will never succeed no matter how many times you try, and retrying them just burns the full run of attempts before failing anyway, delaying the inevitable and making the eventual error harder to spot in the noise.

So Trigger.dev gives you one escape hatch: throw an AbortTaskRunError and the run fails immediately, skipping every remaining retry. The heuristic to internalize is retry on transients, abort on permanents.

A transient failure is one that will probably clear on its own: a 5xx from an overloaded service, a dropped connection, a 429 rate-limit. Try again in a few seconds and it likely works. A permanent failure won’t: a 400 from a request whose payload is malformed and will stay malformed, a validation error, a bug in your own code. Retrying a permanent failure just spends five guaranteed-to-fail attempts. The two throws sit side by side below.

Transient → retry
Permanent → abort

const res = await fetch(downstreamUrl);
if (res.status === 429 || res.status >= 500) {
  throw new Error(`Downstream unavailable: ${res.status}`);
}

Let it throw, and the runtime retries it. A 429 or a 5xx is almost always temporary; a bare throw hands it straight to the configured back-off, and a later attempt will likely succeed.

const parsed = payloadSchema.safeParse(input);
if (!parsed.success) {
  throw new AbortTaskRunError('Invalid payload — retrying will not fix it');
}

Abort, because retrying is pointless. A payload that failed validation will fail identically on every attempt. AbortTaskRunError fails the run now and skips the four wasted back-off windows a plain throw would have cost.

AbortTaskRunError is your one tool here. Throw AbortTaskRunError to fail a run on the spot with no further retries. Everything else throws normally and rides the back-off.

Sort the failures below into the two buckets. The skill being practiced is the transient-versus-permanent split, which is the exact call you’ll make every time you write an external call inside a task.

A task just hit each of these failures. Decide whether the runtime should retry it, or whether you should throw `AbortTaskRunError` and stop now. Drag each item into the bucket it belongs to, then press Check.

Retry (transient) Likely to clear on its own — let it throw

Abort (permanent) Will never succeed — throw AbortTaskRunError

Resend returns 429 Too Many Requests

Postgres connection reset mid-query

Stripe responds 500

The payload fails Zod validation

404 from a resource that will never exist

A downstream service responds 503 Service Unavailable

Run-level retries and the duplicate-side-effect trap

This is the hinge of the lesson, because it creates the problem that idempotency keys solve. To see the trap you first have to know exactly what a retry re-runs, and that depends on which of two retry layers you mean.

A run-level retry is the one you just configured. On an unhandled throw, the runtime re-runs the task from its most recent checkpoint. Here is the part that bites: if the task body has no internal checkpoints between its start and the line that threw, “most recent checkpoint” means the very beginning, so every line runs again.

There is a second, quieter layer you should be able to name so you don’t confuse it with the first. A call-level retry is when an SDK or HTTP client retries a single failed request on its own, like a wrapper that quietly re-attempts a 429’d fetch. That restarts only the one call, not the run. You configure and reason about run-level retries; just know that call-level retries exist underneath, so you don’t accidentally stack a third hand-rolled layer on top of two that already work.

Now the trap. A run-level retry re-executes side-effecting lines, and a side effect is anything that touches the outside world and can’t be quietly taken back: a row written, an email sent, a card charged. Consider a task that loops over the members of an organization and emails each one. It gets to member two hundred and throws, a transient blip on that one send. The run-level retry restarts the body from the top, so it sends to member one again, and member two, all the way back through member one hundred and ninety-nine, every one of whom already got the email on the first pass. This one mechanism is behind very nearly every duplicate-email incident in production background jobs.

Rather than take the description on faith, watch it happen. Predict what the program below prints.

This is a sketch of a task body, not runnable code — reason it through. The task logs a line per member, then throws on the third iteration. Run-level retries are configured to allow two attempts in total, and there are three members. What does the worker's log show across both attempts? Predict what this program prints, then press Check.

async function run() {
  for (const member of ['Ada', 'Bo', 'Cy']) {
    if (member === 'Cy') throw new Error('transient blip');
    console.log(`sent to ${member}`);
  }
}

A run-level retry restarts the body from the top, not from where it threw. There’s no checkpoint inside the loop, so the second attempt replays the completed sends to Ada and Bo before reaching Cy again and exhausting maxAttempts. In a real task those console.logs are live emails — Ada and Bo each get two. The fix in the next section is a per-iteration idempotency key that makes the second send a no-op.

The duplication is right there in the output: Ada and Bo, twice. The runtime will run your side effects twice without hesitation, so the next section is how you make twice safe.

Idempotency keys make a retried step run once

You’ve met this idea before, in different clothes. When you ingested Stripe webhooks, you guarded every event against double-processing with a stable key and a unique constraint, so repeated work happened exactly once. That discipline is the conceptual parent of what comes now: a stable key collapses repeats into one. The difference is that Trigger.dev hands it to you as a first-class runtime primitive instead of a database claim row you assemble yourself.

Every trigger, triggerAndWait, and batchTriggerAndWait accepts an idempotency key . The contract is simple: within a time window, the same key returns the same run. No new run starts and no body re-executes; you get a handle to the original run, finished or still in flight, along with its result. Re-trigger with that key a thousand times and the work happens once.

await chargeCustomer.trigger(payload, {
  idempotencyKey,
  idempotencyKeyTTL: '24h',
});

A few specifics are easy to get wrong, so pin them now. The window is set by idempotencyKeyTTL , and it takes a duration string such as '60s', '5m', '24h', or '3d', not a number of milliseconds. Leave it off and the default is thirty days. Inside that window, a late duplicate (a retried POST that arrives an hour later, a user who double-clicks) maps to the same run; past it, the key is free to start a fresh one.

You don’t usually build the key by splicing strings by hand. You call idempotencyKeys.create(key, { scope }), and you can pass it an array of parts, like [organizationId, 'export', day], which it hashes into one stable key. That’s the ergonomic way to compose a key from the pieces that make it unique. You can still picture it as organizationId:export:day glued together; the array form is just that, done safely.

The real lever is scope , and it’s where the design thinking lives. Scope decides what the key is namespaced against:

scope: 'run', the default, hashes the key together with the parent run id. So the same logical key, re-issued by a retry of the same parent, maps to the same child run. This is exactly what you want for keys inside a task that retries: the retry regenerates the same keys, and the runtime recognizes the child work as already done. It replaces the older habit of manually prefixing ctx.run.id onto your keys.
scope: 'global' hashes the key alone, namespaced against nothing. This means “this runs once, ever”, a stable business key triggered from your app. One export per organization per day, no matter how many times the button is clicked or the action retried.
scope: 'attempt' re-allows the work on each retry attempt. It’s named here only so you recognize it; it’s rarely what you want for a side effect, since the whole point of a key is usually to survive across attempts.

The two scopes you’ll actually reach for are global (from the app) and run (inside a task). Step through both below.

// In a Server Action
const day = todayInTimeZone(org.timeZone);
const key = await idempotencyKeys.create([org.id, 'export', day], {
  scope: 'global',
});
await exportInvoices.trigger({ organizationId: org.id }, {
  idempotencyKey: key,
  idempotencyKeyTTL: '3d',
});

// Inside the task
for (const member of members) {
  const memberKey = await idempotencyKeys.create([member.id, 'notify'], {
    scope: 'run',
  });
  await sendOne.triggerAndWait({ memberId: member.id }, {
    idempotencyKey: memberKey,
  });
}

The app-side key. It’s built from [org.id, 'export', day] with scope: 'global', which is namespaced against nothing, so it means “one export for this org on this day, period.” Double-click the export button, retry the POST, fire the action ten times: the first wins, and the rest return that same run. The business rule lives in the key.

// In a Server Action
const day = todayInTimeZone(org.timeZone);
const key = await idempotencyKeys.create([org.id, 'export', day], {
  scope: 'global',
});
await exportInvoices.trigger({ organizationId: org.id }, {
  idempotencyKey: key,
  idempotencyKeyTTL: '3d',
});

// Inside the task
for (const member of members) {
  const memberKey = await idempotencyKeys.create([member.id, 'notify'], {
    scope: 'run',
  });
  await sendOne.triggerAndWait({ memberId: member.id }, {
    idempotencyKey: memberKey,
  });
}

The in-task key. It’s built per member from [member.id, 'notify'] with scope: 'run' (the default). Because it’s hashed with the parent run id, a retry of this task regenerates the identical key for each member, and the runtime returns the prior child run instead of sending again.

// In a Server Action
const day = todayInTimeZone(org.timeZone);
const key = await idempotencyKeys.create([org.id, 'export', day], {
  scope: 'global',
});
await exportInvoices.trigger({ organizationId: org.id }, {
  idempotencyKey: key,
  idempotencyKeyTTL: '3d',
});

// Inside the task
for (const member of members) {
  const memberKey = await idempotencyKeys.create([member.id, 'notify'], {
    scope: 'run',
  });
  await sendOne.triggerAndWait({ memberId: member.id }, {
    idempotencyKey: memberKey,
  });
}

The TTL is how long that key keeps returning the original run. '3d' means a duplicate export request arriving up to three days later still collapses onto the first run. Leave it off and you’d get the thirty-day default.

1 / 1

That second block is the fix for the duplicate-send trap you just watched. Here it is on its own, because this exact loop is the shape that closes the whole thread.

for (const member of members) {
  const key = await idempotencyKeys.create([member.id, 'notify'], {
    scope: 'run',
  });
  await sendOne.triggerAndWait({ memberId: member.id }, { idempotencyKey: key });
}

Walk it against the trap. scope: 'run' ties each key to the parent run, so when the outer task throws on member two hundred and retries, it regenerates the same two hundred keys it built the first time. For members one through one hundred and ninety-nine, the runtime sees a key whose run already completed and hands back the cached result, so no second email goes out. Only the member it hadn’t reached yet actually sends. This one line is the difference between the duplicated output you predicted a moment ago and a clean one-email-per-member run.

So treat this as a rule, not a nicety: an idempotency key is required on every trigger, triggerAndWait, and wait.forToken. It isn’t an optimization you add when you remember; it’s non-optional, the same way a Server Action’s input schema is non-optional. A trigger without a key is a duplicate side effect waiting for its first retry.

Which key, with which scope, actually dedupes the per-member sends across a parent retry? Choose carefully, because composition and scope both matter.

Inside a task that loops over members and may itself be retried at the run level, which key correctly sends each member exactly one notification — even across a retry of the parent?

const recipientKey = await idempotencyKeys.create(
  [member.id, 'notify'],
  { scope: 'global' },
);

const recipientKey = await idempotencyKeys.create(
  [member.id, 'notify'],
  { scope: 'run' },
);

const recipientKey = await idempotencyKeys.create(
  ['notify'],
  { scope: 'run' },
);

scope: 'run' namespaces the key to the parent run, so a retry of the task regenerates the same per-member keys and the already-completed sends come back cached. scope: 'global' would dedupe that member across every run forever — tomorrow’s notification to the same person would silently never fire. And dropping member.id collapses every member onto one shared key, so only the very first member is ever notified.

Durable pauses: wait.for and wait.until

Idempotency keys make a re-run step safe. But re-runs and checkpoints only exist because something put a boundary in the task. Durable waits are how you create those boundaries on purpose, and they come with two gotchas worth meeting head-on.

Start with the relative one. wait.for pauses for a duration, as in await wait.for({ seconds: 2 }) or await wait.for({ minutes: 5 }). Three things happen, and all three matter. It checkpoints, so it’s a save point. It frees the worker, meaning the compute process is released and you are billed nothing while the run sleeps. And it resumes after the duration on a possibly-new worker, which is precisely why it survives a crash: there’s no live process holding the pause, just a checkpoint and a wake-up time. Reach for it to pace a loop between export pages, or to back off until a rate-limit window reopens.

The trap is that wait.for looks like setTimeout, and setTimeout is the wrong tool here in two compounding ways. Compare them directly.

setTimeout — wrong
wait.for — durable

await new Promise((resolve) => setTimeout(resolve, 2_000));

Wrong on a durable platform. The worker sits there doing nothing for the full two seconds, so you pay for the idle wait, and the timer lives in process memory, so a crash evaporates it. The run never resumes; the pause is simply gone with the worker.

await wait.for({ seconds: 2 });

Right. It checkpoints and releases the worker, so no compute is billed while it sleeps, and because the pause is a checkpoint rather than a live timer, a crash mid-wait resumes cleanly on a new worker.

The absolute version is wait.until : await wait.until({ date }). It has the same checkpoint, free-the-worker, resume semantics, but instead of a duration it waits to a specific wall-clock moment. This is what you want for “send the welcome email twenty-four hours after signup” or “act exactly at the trial’s period end.”

Its gotcha is quieter than the setTimeout one and therefore sneakier. If the date you pass is already in the past, wait.until resolves immediately. It does not error, and it does not skip the rest of the task; it just falls straight through as if there were no wait at all. So if your intent is “do nothing once this date has passed,” wait.until will not enforce that for you, and you have to check the date yourself before you decide to act. Keep that in mind so a past date never surprises you later.

A paginated export that survives a crash

Now assemble all four pieces, namely checkpoints, retries, per-step keys, and durable waits, into one realistic task, the direct precursor to the CSV export you build in the next chapter. The job is to export an organization’s invoices in pages of five hundred, update progress as it goes, then email a download link. Read it as a whole first, then we’ll kill the worker and watch it recover.

export const exportInvoices = schemaTask({
  id: 'export-invoices',
  schema: z.object({ organizationId: z.string(), totalPages: z.number() }),
  retry: { maxAttempts: 5, factor: 1.8, randomize: true },
  run: async ({ organizationId, totalPages }) => {
    for (let page = 1; page <= totalPages; page++) {
      const key = await idempotencyKeys.create([organizationId, 'page', page]);
      await writePage.triggerAndWait({ organizationId, page }, { idempotencyKey: key });
      metadata.set('page', page);
      await wait.for({ seconds: 2 });
    }
    const doneKey = await idempotencyKeys.create([organizationId, 'export-email']);
    await sendReadyEmail.trigger({ organizationId }, { idempotencyKey: doneKey });
  },
});

The task has no session, so tenancy rides in on the payload as organizationId and travels onward into every child (writePage scopes its own queries from it). The org context is cargo, re-derived from data, never ambient.

export const exportInvoices = schemaTask({
  id: 'export-invoices',
  schema: z.object({ organizationId: z.string(), totalPages: z.number() }),
  retry: { maxAttempts: 5, factor: 1.8, randomize: true },
  run: async ({ organizationId, totalPages }) => {
    for (let page = 1; page <= totalPages; page++) {
      const key = await idempotencyKeys.create([organizationId, 'page', page]);
      await writePage.triggerAndWait({ organizationId, page }, { idempotencyKey: key });
      metadata.set('page', page);
      await wait.for({ seconds: 2 });
    }
    const doneKey = await idempotencyKeys.create([organizationId, 'export-email']);
    await sendReadyEmail.trigger({ organizationId }, { idempotencyKey: doneKey });
  },
});

The page loop. Each page gets its own scope: 'run' key (the default) built from the page number, then writePage.triggerAndWait runs it as a durable child step. A retry of the export regenerates these keys, so already-written pages return cached instead of re-exporting.

export const exportInvoices = schemaTask({
  id: 'export-invoices',
  schema: z.object({ organizationId: z.string(), totalPages: z.number() }),
  retry: { maxAttempts: 5, factor: 1.8, randomize: true },
  run: async ({ organizationId, totalPages }) => {
    for (let page = 1; page <= totalPages; page++) {
      const key = await idempotencyKeys.create([organizationId, 'page', page]);
      await writePage.triggerAndWait({ organizationId, page }, { idempotencyKey: key });
      metadata.set('page', page);
      await wait.for({ seconds: 2 });
    }
    const doneKey = await idempotencyKeys.create([organizationId, 'export-email']);
    await sendReadyEmail.trigger({ organizationId }, { idempotencyKey: doneKey });
  },
});

After each page, metadata.set publishes progress to the dashboard and the in-app inspector: a live “page 7 of 20” with no extra plumbing.

export const exportInvoices = schemaTask({
  id: 'export-invoices',
  schema: z.object({ organizationId: z.string(), totalPages: z.number() }),
  retry: { maxAttempts: 5, factor: 1.8, randomize: true },
  run: async ({ organizationId, totalPages }) => {
    for (let page = 1; page <= totalPages; page++) {
      const key = await idempotencyKeys.create([organizationId, 'page', page]);
      await writePage.triggerAndWait({ organizationId, page }, { idempotencyKey: key });
      metadata.set('page', page);
      await wait.for({ seconds: 2 });
    }
    const doneKey = await idempotencyKeys.create([organizationId, 'export-email']);
    await sendReadyEmail.trigger({ organizationId }, { idempotencyKey: doneKey });
  },
});

The pause between pages does double duty. It’s a checkpoint boundary, the save point that makes a mid-export crash resumable, and it paces the load on the database and downstream so the export doesn’t hammer them flat.

export const exportInvoices = schemaTask({
  id: 'export-invoices',
  schema: z.object({ organizationId: z.string(), totalPages: z.number() }),
  retry: { maxAttempts: 5, factor: 1.8, randomize: true },
  run: async ({ organizationId, totalPages }) => {
    for (let page = 1; page <= totalPages; page++) {
      const key = await idempotencyKeys.create([organizationId, 'page', page]);
      await writePage.triggerAndWait({ organizationId, page }, { idempotencyKey: key });
      metadata.set('page', page);
      await wait.for({ seconds: 2 });
    }
    const doneKey = await idempotencyKeys.create([organizationId, 'export-email']);
    await sendReadyEmail.trigger({ organizationId }, { idempotencyKey: doneKey });
  },
});

The final email. It carries its own idempotency key too, so if the whole task retries after the last page, the “your export is ready” email is sent once, not twice. Every side effect in the task is key-guarded, with no exceptions.

1 / 1

Now kill it. The worker dies at page seven of twenty. A new worker spins up and rehydrates from the last checkpoint, the wait.for after page six. Pages one through six don’t re-export: their triggerAndWait calls re-issue the same scope: 'run' idempotency keys, the runtime sees those child runs already completed, and it returns their cached results. Page seven, the one in flight when the worker died, re-executes from the top, and that’s fine, because writing a page is the work, not a duplicate side effect. The loop carries on to page twenty, the final email fires exactly once under its own key, and the run finishes.

Every claim in that paragraph traces back to a named mechanism from this lesson. The checkpoint at the wait.for is why there’s somewhere to resume from. The run-level retry is what re-ran the task at all. The per-step idempotency keys are why pages one through six came back cached instead of duplicated. The durable wait is what created the boundary in the first place. That’s the whole lesson in one task, and it is, almost line for line, the shape the CSV export ships in the next chapter.

Cancellation, briefly

One last thing to name, not to drill. Sometimes a run needs to stop before it finishes: a user cancels an export, or an admin kills a runaway job. A run can be canceled from the Trigger.dev dashboard, or programmatically with runs.cancel(runId).

Cancellation, though, is cooperative, and the word matters. The runtime stops scheduling new steps the moment you cancel, but a step that’s already running only stops if you wired it to. Inside the body, the run exposes an AbortSignal that fires on cancel; forward it into your fetch and SDK calls and an in-flight HTTP request actually aborts. Skip that and your canceled run keeps grinding through whatever call it was midway through until that call returns on its own. Cooperative cancellation in one line: the runtime stops the next step for free, and stopping the current one is on you. Forward the signal and check it at step boundaries.

External resources

This API moved recently. Trigger.dev v4 is the current line, and the web is still full of v3 examples that will quietly mis-teach you. When you reach for a detail this lesson didn’t cover, go to the canonical current docs, not a blog post.

Trigger.dev — Idempotency

trigger.dev

Idempotency keys, TTL, and scopes: the canonical reference for the primitive this lesson centers on.

Trigger.dev — Errors & Retrying

trigger.dev

The retry config, AbortTaskRunError, and the full back-off surface, current as of v4.

AWS Builders' Library — Timeouts, retries, and backoff with jitter

aws.amazon.com

Marc Brooker's gold-standard write-up of why backoff needs jitter: the timeless theory under this lesson's retry knobs.

You now have the mental model the rest of background work hangs on. A durable task is a chain of small, idempotent steps separated by checkpoints. The runtime owns resume and retry; you own two things: making each step safe to run twice, and choosing where the boundaries go. In the next lesson you’ll add the one kind of wait this one deliberately skipped, pausing a task on a signal from the outside world such as a human clicking approve or a third party calling you back, along with the mandatory timeout that keeps such a wait from leaking.