Multi-Agent Coordination#

A single agent can read files, call APIs, and reason about results. But some tasks are too broad, too slow, or too dangerous for one agent to handle alone. Debugging a production outage might require one agent analyzing logs, another checking infrastructure state, and a third reviewing recent deployments – simultaneously. Multi-agent coordination is how you split work across agents without them stepping on each other.

The hard part is not spawning multiple agents. The hard part is deciding which coordination pattern fits the task, how agents share information, and what happens when they disagree.

When to Use Multiple Agents#

Not every task benefits from multiple agents. The overhead of coordination – message passing, state synchronization, conflict resolution – can make multi-agent slower than a single capable agent.

Use multiple agents when:

The task has independent subtasks that benefit from parallel execution (checking health across 20 services).
Different subtasks require different tool sets or permissions (a read-only analyzer and a write-capable deployer).
The task exceeds a single agent’s context window (analyzing a 500-file codebase).
You need separation of concerns for safety (one agent proposes changes, another reviews them).

Use a single agent when:

The task is sequential and each step depends on the previous result.
The context fits comfortably in one agent’s window.
Coordination overhead would exceed the time saved by parallelism.
The task requires deep, continuous reasoning across multiple steps.

Coordination Patterns#

Leader-Follower (Orchestrator)#

One agent acts as the leader. It decomposes the task, assigns subtasks to follower agents, collects results, and synthesizes a final answer. Followers execute their assigned work and report back.

Leader Agent
  |-- assigns --> Follower A (check pod status)
  |-- assigns --> Follower B (analyze error logs)
  |-- assigns --> Follower C (review recent commits)
  |-- collects results from A, B, C
  |-- synthesizes final report

When to use: Most general-purpose multi-agent tasks. Works well when one agent needs to maintain the big picture while others handle details.

Implementation sketch:

class LeaderAgent:
    async def execute(self, task: str):
        # Decompose into subtasks
        subtasks = await self.decompose(task)

        # Assign to followers and run in parallel
        results = await asyncio.gather(*[
            self.assign_to_follower(subtask)
            for subtask in subtasks
        ])

        # Synthesize results
        return await self.synthesize(task, results)

    async def assign_to_follower(self, subtask: dict) -> dict:
        follower = self.get_follower(subtask["required_tools"])
        result = await follower.execute(subtask["description"])
        return {"subtask": subtask, "result": result, "status": "complete"}

Risks: The leader is a single point of failure. If it decomposes the task poorly, all followers do useless work. Mitigate by having the leader validate decomposition before dispatching.

Fan-Out / Fan-In#

A special case of leader-follower optimized for embarrassingly parallel work. One agent fans out the same operation across many targets, then a merge step combines the results.

Fan-Out: "Check health of each service"
  --> Agent 1: check service-a  --> healthy
  --> Agent 2: check service-b  --> unhealthy (503)
  --> Agent 3: check service-c  --> healthy
  --> Agent 4: check service-d  --> timeout

Fan-In: Aggregate results
  --> "3 healthy, 1 unhealthy (service-b: 503), 1 timeout (service-d)"

When to use: Homogeneous tasks across many targets – health checks, log analysis across services, configuration audits, vulnerability scans across repositories.

Key decisions:

Partial failure handling. If 3 of 20 agents fail, do you retry, skip, or abort? For health checks, skip and report. For deployments, abort.
Result aggregation. Define the merge strategy before fanning out. Sum, union, majority vote, or structured report.
Concurrency limits. Do not fan out to 100 agents simultaneously if they all hit the same API. Use a semaphore or rate-limited pool.

async def fan_out_fan_in(targets: list[str], operation: str, max_concurrent: int = 10):
    semaphore = asyncio.Semaphore(max_concurrent)

    async def bounded_execute(target):
        async with semaphore:
            agent = create_agent(tools=["kubectl", "curl"])
            try:
                result = await agent.execute(f"{operation} for {target}")
                return {"target": target, "status": "success", "result": result}
            except Exception as e:
                return {"target": target, "status": "error", "error": str(e)}

    results = await asyncio.gather(*[bounded_execute(t) for t in targets])

    successes = [r for r in results if r["status"] == "success"]
    failures = [r for r in results if r["status"] == "error"]
    return {"successes": successes, "failures": failures, "total": len(targets)}

Pipeline (Sequential Handoff)#

Agents form a chain. Each agent performs its step and passes the result to the next. This is appropriate when the task has a natural sequence and each step transforms the data.

Agent A (analyze) --> Agent B (plan) --> Agent C (execute) --> Agent D (verify)

When to use: Multi-stage workflows where each stage requires different expertise or permissions. Example: one agent writes a Terraform plan, a second reviews it for security issues, a third applies it, and a fourth verifies the infrastructure state.

Key design rule: Define the handoff contract between stages explicitly. Agent A’s output schema must match Agent B’s expected input. If Agent A returns free-form text and Agent B expects structured JSON, the pipeline breaks silently.

pipeline:
  - stage: analyze
    agent: analyzer
    output_schema: { type: object, properties: { issues: { type: array }, severity: { type: string } } }
  - stage: plan
    agent: planner
    input_from: analyze
    output_schema: { type: object, properties: { actions: { type: array }, rollback_plan: { type: string } } }
  - stage: execute
    agent: executor
    input_from: plan
    requires_approval: true

Peer Consensus#

Multiple agents independently analyze the same input, then compare conclusions. Useful when accuracy matters more than speed and you want to catch errors that a single agent might make.

When to use: High-stakes decisions – security reviews, production change approvals, root cause analysis where a wrong answer is expensive.

async def consensus_analysis(task: str, num_agents: int = 3, threshold: float = 0.66):
    agents = [create_agent(temperature=0.3 + i * 0.1) for i in range(num_agents)]

    results = await asyncio.gather(*[agent.execute(task) for agent in agents])

    # Check agreement
    conclusions = [r["conclusion"] for r in results]
    most_common = max(set(conclusions), key=conclusions.count)
    agreement = conclusions.count(most_common) / len(conclusions)

    if agreement >= threshold:
        return {"conclusion": most_common, "confidence": agreement, "details": results}
    else:
        return {"conclusion": None, "confidence": agreement,
                "conflict": "Agents disagreed — escalate to human review", "details": results}

Task Decomposition#

The leader agent’s first job is breaking the task into subtasks. Bad decomposition wastes every follower’s effort.

Decompose by independence. Each subtask should be completable without waiting for another subtask’s result. If subtask B needs the output of subtask A, they are not truly parallel – put them in a pipeline.

Decompose by tool requirements. Group work by the tools needed. One agent with kubectl access checks cluster state. Another with GitHub access reviews PRs. This minimizes the permission surface per agent.

Decompose by scope. For large codebases, split by directory, service, or module. Each agent analyzes its portion. The leader merges results.

Validate before dispatching. Have the leader check: Are the subtasks collectively exhaustive (covering the full task)? Are they mutually exclusive (no duplicate work)? Can each subtask be completed with the assigned tools?

Shared State Management#

When agents work on related subtasks, they often need to share state. Who has already checked service-a? What did the log analysis find? Where are we in the deployment?

Shared state store. A key-value store (Redis, a shared file, a database) where agents read and write state. Each agent checks the store before starting work and writes results when done.

class SharedState:
    def __init__(self, store):
        self.store = store

    async def claim_task(self, task_id: str, agent_id: str) -> bool:
        """Atomic claim — returns True if this agent got the task."""
        return await self.store.set_if_not_exists(f"task:{task_id}:owner", agent_id)

    async def report_result(self, task_id: str, result: dict):
        await self.store.set(f"task:{task_id}:result", json.dumps(result))
        await self.store.set(f"task:{task_id}:status", "complete")

    async def get_completed(self) -> list[dict]:
        keys = await self.store.keys("task:*:status")
        return [await self.store.get(k.replace(":status", ":result"))
                for k in keys if await self.store.get(k) == "complete"]

Message passing. Agents communicate through a message queue. No shared mutable state. Each agent sends messages about what it found and receives messages about what others found. Cleaner than shared state but higher latency.

Append-only log. Each agent appends its findings to a shared log. No agent modifies another’s entries. The leader reads the full log to synthesize results. Simple, conflict-free, but the log can grow large.

Conflict Resolution#

When two agents try to modify the same resource – editing the same file, updating the same configuration, deploying to the same environment – you need conflict resolution.

Lock-based. An agent acquires a lock on a resource before modifying it. Other agents wait or skip. Simple but can cause deadlocks if agents hold locks while waiting for other locks.

Last-writer-wins. Accept all modifications. The last one overwrites previous ones. Only appropriate when overwrites are safe (updating a status field, not merging code changes).

Merge with human review. Both agents produce their modifications. A merge step (automated or human) combines them. This is the safest approach for code changes and configuration updates.

Priority-based. Assign priority levels to agents. When conflicts arise, the higher-priority agent’s result takes precedence. Works for alerting systems where a security agent’s findings override a performance agent’s recommendations.

Choosing the Right Pattern#

Scenario	Pattern	Reason
Health check across 50 services	Fan-out/fan-in	Homogeneous, embarrassingly parallel
Debug a production incident	Leader-follower	Need one agent to maintain context and direct investigation
Deploy with safety review	Pipeline	Natural sequence: plan, review, apply, verify
Security audit of a critical change	Peer consensus	Multiple independent reviews catch more issues
Codebase-wide refactoring	Leader-follower + fan-out	Leader plans the refactoring, fans out file changes to followers
Incident postmortem analysis	Leader-follower	One agent gathers timelines, logs, and changes; synthesizes narrative

Start with the simplest pattern that fits. Leader-follower handles most cases. Only add complexity (consensus, pipelines) when the task structure demands it. The coordination overhead should always be justified by measurable improvement in speed, accuracy, or safety.