Workflows that survive contact with reality: durable execution, explained

Most workflow engines fail the same way: they assume nothing fails

Naive workflow engines run a process top to bottom in a single execution context. If the process is fast and the dependencies are reliable, this works. If the process spans hours or days, calls APIs that occasionally rate-limit, or waits on humans, the same approach produces ghosts — workflow runs that are technically alive but actually stuck, with no clean way to know the difference.

The pattern we see when teams hit this wall: someone writes a custom retry loop, then a custom state machine, then a custom queue, then a custom dashboard to see which runs are stuck. Eighteen months later they have built a worse version of a durable execution engine, and they cannot ship anything new because they are debugging the workflow engine they accidentally built.

Durable execution checkpoints every step so workflows survive restarts

A durable execution engine treats every step of a workflow as a checkpoint. The step's input, the step's output, and the workflow's logical position are persisted before the next step runs. If the worker process dies mid-workflow, another worker picks up at the last checkpoint with the same inputs and the same logical state. The workflow does not 'restart' — it 'resumes.'

Success rate: 99.7% first-attempt, prod median
Connectors: 300+ native + generic
Throughput: Millions/day horizontally scaled
Active workflows: 1,400 avg per tenant

Idempotency is the contract every step has to honor

Resumable execution requires that every step be safe to retry. Send-payment cannot send the payment twice if the worker crashed after sending it but before checkpointing. The step has to be idempotent: same input always produces the same observable effect. This is implementation discipline, not a framework feature, and it is the most common reason hand-rolled workflow engines produce double-charged customers.

Our Workflow Automation engine enforces this at the SDK level. Every step accepts an idempotency key, downstream connectors deduplicate against it, and the engine refuses to run a step that has not declared how it handles retry. The discipline is opinionated; it is also why production runs hit 99.7% success rate first attempt.

Saga compensation is how you unwind multi-step workflows that fail late

A workflow that creates a customer, charges a card, provisions infrastructure, and sends a welcome email has four side-effects, any of which can fail. The naive recovery — undo it all — does not work because the steps already happened in the world. Saga compensation runs explicit reverse actions for each completed step: refund the card, deprovision the infrastructure, mark the customer as canceled, send a 'we hit a snag' email.

The compensation logic is not optional. It is a first-class part of every workflow that has external side effects. Engines that do not support compensation force every workflow author to write it from scratch, badly.

Human-in-the-loop steps wait without consuming resources

A workflow that pauses for human approval — a manager review, a compliance check, a customer signature — should not hold a worker thread for three days. Durable execution treats human-in-the-loop as a step that yields the workflow's state, persists the wait, and resumes when an external event arrives. The cost of a paused workflow is approximately zero.

This is also where audit posture comes from. Every approval has the approver, the timestamp, the input the approver saw, and the decision recorded as a checkpointed step. The compliance team gets the trail; the engineer never wrote special code for it.

Retry with backoff and jitter is implementation detail, not strategy

Every workflow step gets configurable retry: maximum attempts, exponential backoff, jitter to avoid thundering herd, dead-letter routing. The engine handles it. Workflow authors specify the policy; they do not write the loop. This sounds trivial, and it is, until you find your team's hand-rolled retry logic causing a cascading failure during the post-mortem on the Tuesday outage.

Versioning lets you change workflows without breaking in-flight runs

Workflows are long-running. A workflow that started yesterday on version 1.2 might still be paused on a human approval today, when version 1.3 ships. The engine needs to support both versions running concurrently — runs that started on 1.2 finish on 1.2, runs that started on 1.3 use the new logic — without forcing engineers to migrate in-flight executions or freeze deploys.

Without this, every workflow change requires a deploy freeze and a manual migration of stuck runs. Teams stop iterating on workflows because the cost of changing one is too high. With versioning, workflows evolve continuously and the engine handles the bookkeeping.

Observability per workflow run is the difference between debuggable and a mystery

Every workflow run produces a complete trace: every step's input, every step's output, every retry attempt, every checkpoint, every external API call, every human action. Failures are investigable from a single pane of glass. Engineers debug by replaying the trace, not by sshing into a worker and grepping logs.

OpenTelemetry-native instrumentation means the workflow trace integrates with your existing Datadog, New Relic, or Honeycomb stack. The workflow engine is not a separate observability silo; it joins the rest of your distributed-systems telemetry.

AI-as-a-step is a first-class node, not a special case

Most workflow engines treat an LLM call as a generic HTTP request. We treat it as a typed node: declared inputs, declared output schema, evaluation hooks, fallback paths. When the model returns malformed JSON, the engine does not silently fail; it triggers the fallback. When the model regresses, the eval set surfaces it before it ships.

This matters because AI nodes fail differently from API calls. An API either returns a 200 or it doesn't. A model returns a 200 with content that may or may not be useful. The engine has to know the difference.

We had a workflow that ran for three weeks across 14 systems with two human approval gates. When the third-party payment processor had a four-hour outage in the middle, the engine just paused on retry and resumed when the API was back. The workflow finished a day late. Nobody noticed. That is when I knew we were on the right engine.
— Director of Engineering, ops platform client

When durable execution earns its keep

Durable execution is overkill for short, reliable workflows. It pays off when you have any of the following: workflows that span hours or days, processes that touch three or more external systems, human-in-the-loop steps, regulated industries that require audit trails, or any workflow where 'it ran but I do not know what happened' is unacceptable. For most production workflows in real businesses, all of those apply.

Frequently asked

What is durable workflow execution?

Durable workflow execution is an architecture where every step of a workflow is checkpointed — its input, output, and the workflow's logical position are persisted — so the workflow can resume from the last successful step after any failure. Worker crashes, network outages, downstream API timeouts, and human delays all become recoverable rather than catastrophic. The workflow does not restart; it resumes.

How is a durable workflow engine different from a job queue?

A job queue runs individual tasks; a durable workflow engine orchestrates multi-step processes that may span hours or days, with branching logic, retries, compensations, and human-in-the-loop steps. The queue does not know that step three is supposed to follow step two; the workflow engine does, and it persists that logical state across failures.

What is saga compensation in a workflow context?

Saga compensation is the pattern of running explicit reverse actions when a multi-step workflow fails after side-effects have already happened. If step three fails after steps one and two succeeded, the engine runs the declared compensation for step two and step one in reverse order. It is how you unwind a partially-completed workflow without leaving the system in an inconsistent state.

Why is idempotency important for workflows?

Because durable execution requires that every step be safe to retry. If a worker crashes after sending a payment but before checkpointing, another worker will retry the step. Without idempotency — same input, same observable effect — the customer gets charged twice. Our engine enforces idempotency at the SDK level so workflow authors cannot ship a step that handles retry incorrectly.

Can a workflow engine handle approvals that take days?

Yes, when the engine supports human-in-the-loop steps as first-class. The workflow yields, the engine persists the wait, and execution resumes when an external event — an approval click, a webhook from a signature service — arrives. The cost of a paused workflow is approximately zero, so workflows that wait days or weeks for human action are routine, not exceptional.

How does workflow versioning work?

Workflows that started on a previous version finish on that version, and new workflow runs use the new version. The engine supports running multiple versions concurrently and handles the bookkeeping. Without this, every workflow change requires a deploy freeze and a manual migration of in-flight runs, which is why teams without versioning stop iterating on their workflows.

When is durable execution worth it versus a simpler approach?

Durable execution earns its keep when workflows span hours or days, touch three or more external systems, include human-in-the-loop steps, run in regulated industries, or cannot tolerate 'it ran but I don't know what happened.' For short, single-system workflows on reliable infrastructure, a job queue or a cron is enough. For most production workflows in real businesses, durable execution is the right default.