Long-Running Workflow Orchestration#

Most agent examples show single-turn or single-session tasks: answer a question, write a function, debug an error. Real projects are different. Building a feature, migrating a database, setting up a monitoring stack – these take hours, span multiple sessions, involve parallel work streams, and must survive context window resets, session timeouts, and partial failures.

This article covers the architecture for workflows that last hours or days: how to model progress as a state machine, how to checkpoint for reliable resumption, how to delegate to parallel sub-agents without losing coherence, and how to recover when things fail partway through.

State Machine Design#

A long-running workflow is a state machine. Each state represents a phase of work. Transitions happen when a phase completes, fails, or needs human input. The state is persisted to disk so any session can resume from where the last one left off.

Minimal State Machine#

┌──────────┐     ┌──────────────┐     ┌──────────────┐
│ PLANNING │────▶│ IMPLEMENTING │────▶│   TESTING    │
└──────────┘     └──────────────┘     └──────────────┘
     │                  │                    │
     │                  │                    │
     ▼                  ▼                    ▼
┌──────────┐     ┌──────────────┐     ┌──────────────┐
│ BLOCKED  │     │   BLOCKED    │     │   DEPLOYING  │
│(needs    │     │ (needs human │     │              │
│ input)   │     │  review)     │     └──────┬───────┘
└──────────┘     └──────────────┘            │
                                             ▼
                                      ┌──────────────┐
                                      │   COMPLETE   │
                                      └──────────────┘

State File on Disk#

<!-- workflow-state.md -->
# Workflow State

**Current phase**: IMPLEMENTING
**Started**: 2026-02-22 09:00
**Last checkpoint**: 2026-02-22 11:30

## Phase Status
| Phase | Status | Entered | Completed |
|---|---|---|---|
| Planning | COMPLETE | 09:00 | 09:45 |
| Implementing | IN_PROGRESS | 09:45 | -- |
| Testing | PENDING | -- | -- |
| Deploying | PENDING | -- | -- |

## Implementation Progress
- [x] Database schema (committed: abc123)
- [x] API routes: search, categories (committed: def456)
- [ ] API routes: feedback, suggestions  <-- CURRENT
- [ ] Rate limiting middleware
- [ ] Request logging

## Blocking Issues
(none)

## Decisions Log
- 09:15: Chose D1 over Postgres for database (simpler, free tier, sufficient for our scale)
- 09:30: Chose single-file Worker pattern (under 500 lines, no need to split)
- 10:00: Chose FTS5 over LIKE queries for search (ranking, performance)

Any agent reading this file knows exactly where the project stands, what has been decided, and what to work on next. The file costs 500-1,000 tokens of context and replaces tens of thousands of tokens of conversation replay.

Checkpoint-and-Resume Architecture#

The key insight: a checkpoint is not a save file – it is a re-entry point. A good checkpoint contains enough information for a fresh agent (with zero prior context) to pick up the work and make progress immediately.

What Makes a Good Checkpoint#

# Checkpoint: API Routes 60% Complete
**Date**: 2026-02-22 11:30
**Resumable by**: Any agent with access to this repository

## Context Required to Resume
1. Read `CLAUDE.md` for project conventions
2. Read this checkpoint for current state
3. Read `TODO.md` for remaining work
4. Read `src/index.ts` for existing implementation

## State Summary
- 3 of 5 API endpoints implemented (search, categories, get-by-id)
- Rate limiting and request logging not yet started
- All implemented endpoints tested manually, working

## Key Decisions (do not re-evaluate)
- Using FTS5 for search with rank ordering
- Cache search results in KV for 5 minutes
- Rate limit: 60 req/min per IP via KV with TTL expiration
- Hash IPs with SHA-256 for privacy in request logs

## Current File State
- `src/index.ts`: 450 lines, 3 route handlers implemented
- `schema/0001-init.sql`: Complete, applied to production
- `wrangler.jsonc`: Complete, D1 and KV bindings configured

## Next Actions (in order)
1. Implement POST /api/v1/feedback endpoint
2. Implement POST /api/v1/suggestions endpoint
3. Add rate limiting middleware (checkRateLimit function exists, wire into fetch handler)
4. Add request logging with ctx.waitUntil()
5. Test all endpoints

Checkpoint Frequency#

Project Duration	Checkpoint Frequency	Trigger
30 minutes	End of session only	Task completion
1-2 hours	Every 30-45 minutes	Phase boundaries, before risky operations
3-6 hours	Every 20-30 minutes	Phase boundaries, before sub-agent delegation, before breaks
Full day+	Every completed subtask	Each TODO item completion, each commit

Over-checkpointing is better than under-checkpointing. A checkpoint costs 2 minutes to write and saves 30+ minutes of context reconstruction. When context compression starts summarizing earlier work, unchecked-pointed decisions are lost permanently.

Resume Protocol#

When a new session starts on a checkpointed project:

1. Read CLAUDE.md                    (project conventions — 500 tokens)
2. Read workflow-state.md            (current phase, progress — 500 tokens)
3. Read latest checkpoint            (detailed state — 500 tokens)
4. Read TODO.md                      (remaining work — 300 tokens)
5. Optionally read 1-2 source files  (if the current task needs them)
                                     ─────────────────────────────
                                     Total: ~2,000-4,000 tokens

Compare to: replaying 3 hours of conversation history = 60,000-100,000 tokens

The agent is productive within seconds instead of spending tokens reconstructing context.

Sub-Agent Delegation for Parallel Execution#

Long-running workflows almost always have subtasks that can run in parallel. Implementing three independent API endpoints. Writing tests while documentation is being drafted. Researching two alternative approaches simultaneously.

The Spec-Driven Delegation Pattern#

The leader agent writes a scoped specification document for each parallelizable subtask, spawns a sub-agent with that spec, and collects results.

Leader’s workflow:

1. Identify parallelizable subtasks from TODO
2. For each subtask:
   a. Write a spec file (specs/subtask-name.md)
   b. List exactly which files to read and modify
   c. Define the deliverable (what "done" looks like)
   d. List constraints (do not modify X, follow pattern Y)
3. Spawn sub-agents with their spec files
4. Collect results
5. Integrate, resolve any conflicts
6. Update checkpoint and TODO
7. Commit

Spec file structure:

# Spec: Implement Feedback Endpoint

## Goal
Add POST /api/v1/feedback endpoint to the Worker API.

## Files to Read
- src/index.ts (existing routes, helper functions)
- schema/0001-init.sql (article_feedback table schema)

## Files to Modify
- src/index.ts (add route handler)

## Requirements
- Validate: content_id exists in content_index table
- Validate: feedback_type is one of: helpful, inaccurate, outdated, needs-examples, other
- Validate: comment is optional, max 1000 characters
- Hash client IP for privacy (use existing hashIP function)
- Return 201 with { id, content_id, feedback_type, created_at }

## Constraints
- Follow existing patterns (see parseBody, json helpers)
- Do not modify database schema
- Do not add new dependencies
- Do not change existing route handlers

## Deliverable
- Working endpoint added to src/index.ts
- Brief report: lines changed, any decisions made

Why Specs Beat Verbal Instructions#

Verbal delegation	Spec document delegation
“Add a feedback endpoint”	Explicit requirements, file list, constraints
Sub-agent reads entire codebase to understand context	Sub-agent reads 2 files listed in spec
Sub-agent makes assumptions about validation rules	Validation rules are specified
Sub-agent might modify unrelated code	Constraints prevent scope creep
Leader must review everything	Leader checks deliverable against spec

A spec document costs 200-400 tokens to write. It saves thousands of tokens the sub-agent would spend exploring and guessing, and it prevents integration conflicts.

Collecting and Integrating Results#

Sub-agents return structured results:

## Result: Feedback Endpoint

**Status**: Complete
**Lines added**: 95-138 in src/index.ts
**Decisions**: Used crypto.randomUUID() for feedback ID, consistent with existing patterns
**Tests needed**: POST with valid data, POST with invalid feedback_type, POST with nonexistent content_id
**Issues**: None

The leader reads 100-token summaries from each sub-agent instead of their full execution traces. If a result looks wrong, the leader can read the modified files directly – but only when needed, not by default.

Conflict Resolution Between Sub-Agents#

When multiple sub-agents modify the same file (common with single-file patterns), conflicts will occur. Handle them in order:

Prevent conflicts by design. Assign non-overlapping sections of the file. “Agent A: add routes in the route handler section. Agent B: add helper functions above the route handler.”
Sequential integration. Run sub-agents in parallel but integrate their changes sequentially. Apply Agent A’s changes, commit, then apply Agent B’s changes to the updated file.
Leader merges. Both sub-agents produce their changes. The leader reads both diffs and merges them manually. This is the fallback when changes overlap.

Failure Recovery#

Long-running workflows will encounter failures. A sub-agent produces incorrect output. An API call fails after retries. A migration reveals a schema incompatibility. The key question is not whether failures happen but whether the workflow can recover without starting over.

Failure Modes and Recovery#

Failure	Recovery Strategy
Sub-agent produces incorrect output	Rerun with more specific spec. If the spec was unclear, fix the spec first
Sub-agent fails to complete	Check error, fix the blocking issue, rerun. The spec and checkpoint make reruns cheap
External service is down	Checkpoint current state, retry later. Do not retry in a tight loop
Schema migration partially applied	Rollback if possible (down migration). If not, fix forward with a corrective migration
Context window exhausted mid-task	Checkpoint immediately, start new session, resume from checkpoint
Session timeout/disconnect	Resume from last checkpoint in new session
Two sub-agents produced conflicting changes	Leader reviews both, picks the better approach or merges manually

The “Never Lose More Than One Step” Rule#

Design checkpoints so that any failure loses at most one step of work. If you checkpoint after every TODO item, the worst case is redoing one item. If you checkpoint only at phase boundaries, a failure mid-phase loses the entire phase.

Aggressive checkpointing (lose at most 1 item):
  [x] Item 1 → checkpoint
  [x] Item 2 → checkpoint
  [x] Item 3 → checkpoint
  [ ] Item 4 → FAILURE here → redo only item 4

Lazy checkpointing (lose entire phase):
  [x] Item 1
  [x] Item 2
  [x] Item 3
  [ ] Item 4 → FAILURE here → might redo items 1-4

Git commits serve as automatic checkpoints. Every git commit creates an indestructible restore point. If a sub-agent corrupts a file, git checkout -- path/to/file restores it instantly.

Real-World Example: Multi-Hour Article Writing Project#

Here is how these patterns combine for a real project – writing 5 knowledge articles for a content platform:

Phase 1: Planning (15 min, leader only)

1. Read source material (design docs, existing articles)
2. Write article plan with titles, sections, source material
3. Create TODO.md with all 5 articles
4. Checkpoint: "Plan complete"
5. Commit plan

Phase 2: Writing (2 hours, parallel sub-agents)

1. Write spec for each article (what it covers, source material, differentiation)
2. Spawn 2-3 sub-agents with article specs
3. Collect drafts
4. Review each for quality and consistency
5. Update TODO as each article completes
6. Checkpoint after each batch: "3 of 5 articles complete"
7. Commit articles

Phase 3: Verification and Deploy (30 min, leader)

1. Build site to verify all articles compile
2. Sync content to database
3. Deploy
4. Verify articles appear in search
5. Final checkpoint: "Complete"
6. Commit

Total context used by leader: ~15K tokens across 3 hours (plans, specs, summaries, checkpoints). Total context used if done in a single ReAct loop without delegation: ~150K+ tokens (reading all source material, writing all articles, tracking all state in conversation history). The delegated approach uses 10x less leader context and produces the same result.

Orchestration Checklist#

Before starting a multi-hour workflow:

State machine defined. What are the phases? What triggers transitions?
CLAUDE.md written. Project conventions that every session needs
TODO.md created. Phased task list with checkboxes
Checkpoint strategy decided. How often? What goes in each checkpoint?
Sub-agent specs templated. What information does each spec need?
Failure recovery plan. What happens if a sub-agent fails? If a session times out?
Human checkpoints placed. Which phase transitions need human review?
Commit strategy. Commit after each checkpoint? After each phase?
Context budget estimated. How many tokens does the leader need per phase? Will it fit?

The upfront investment in this structure is 15-30 minutes. For a 3-hour project, this saves at least an hour of context reconstruction, re-derivation, and confusion. For a multi-day project, it is the difference between success and an incoherent mess.