Agent Error Handling#

Agents call tools that call APIs that talk to services that query databases. Every link in that chain can fail. The difference between a useful agent and a frustrating one is what happens when something breaks.

Classify the Failure First#

Before deciding how to handle an error, classify it. The strategy depends entirely on whether the failure is transient or permanent.

Transient failures will likely succeed on retry: network timeouts, rate limits (HTTP 429), server overload (HTTP 503), connection resets, temporary DNS failures. These are the majority of failures in practice.

Permanent failures will never succeed no matter how many times you retry: invalid input (HTTP 400), authentication failure (HTTP 401), resource not found (HTTP 404), permission denied (HTTP 403), schema validation errors.

Ambiguous failures require judgment: HTTP 500 could be a transient server bug or a persistent code issue. Timeouts could mean the service is slow (transient) or the request is too large (permanent). When in doubt, retry a small number of times and then treat as permanent.

def classify_error(error: Exception, status_code: int | None = None) -> str:
    if status_code in (400, 401, 403, 404, 422):
        return "permanent"
    if status_code in (429, 502, 503, 504):
        return "transient"
    if isinstance(error, ConnectionResetError | TimeoutError):
        return "transient"
    if isinstance(error, (ValueError, TypeError)):
        return "permanent"
    return "ambiguous"

Retry with Exponential Backoff and Jitter#

For transient failures, retry with increasing delays. Exponential backoff prevents hammering a struggling service. Jitter prevents thundering herds when multiple agents retry simultaneously.

import random
import asyncio

async def retry_with_backoff(
    func,
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0,
):
    for attempt in range(max_retries + 1):
        try:
            return await func()
        except Exception as e:
            classification = classify_error(e, getattr(e, "status_code", None))

            if classification == "permanent":
                raise  # No retry for permanent failures

            if attempt == max_retries:
                raise  # Exhausted retries

            delay = min(base_delay * (2 ** attempt), max_delay)
            jitter = random.uniform(0, delay * 0.5)
            await asyncio.sleep(delay + jitter)

The TypeScript equivalent:

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxRetries = 3,
  baseDelay = 1000,
  maxDelay = 30000
): Promise<T> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (isPermanentError(error) || attempt === maxRetries) throw error;
      const delay = Math.min(baseDelay * 2 ** attempt, maxDelay);
      const jitter = Math.random() * delay * 0.5;
      await new Promise((r) => setTimeout(r, delay + jitter));
    }
  }
  throw new Error("Unreachable");
}

Key constraints: cap the maximum delay (nobody should wait 5 minutes between retries), cap the total number of retries (3-5 is typical), and always have an escape hatch for permanent errors.

Circuit Breaker for External Calls#

When a service is down, retrying every request wastes time and adds load to an already struggling system. A circuit breaker stops calling after repeated failures and periodically tests whether the service has recovered.

Three states: closed (normal operation), open (all calls fail immediately), and half-open (one test call allowed to check recovery).

import time

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5, recovery_timeout: float = 60.0):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.last_failure_time = 0.0
        self.state = "closed"

    async def call(self, func):
        if self.state == "open":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = "half-open"
            else:
                raise CircuitOpenError(
                    f"Circuit open. Retry after {self.recovery_timeout}s."
                )

        try:
            result = await func()
            if self.state == "half-open":
                self.state = "closed"
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = "open"
            raise

Use one circuit breaker per external service, not per tool. If your GitHub API circuit is open, all GitHub-related tools should fail fast rather than each discovering the outage independently.

Graceful Degradation#

When a tool fails and retries are exhausted, the agent should not just stop. It should degrade gracefully – provide partial results, use cached data, or try an alternative approach.

Patterns that work:

Fallback to cached data. If the API for fetching current deployment status fails, use the last known state and tell the user it may be stale.

Alternative tool paths. If the structured search tool fails, fall back to a simpler grep-based search. The results may be less precise, but they are better than nothing.

Partial completion. If the agent is processing 10 files and fails on the third, return results for the two that succeeded and report the failure on the third. Do not discard all work.

Explicit user handoff. When the agent cannot proceed autonomously, give the user a clear summary of what failed, what it tried, and what the user can do manually.

async def get_deployment_status(env: str) -> dict:
    try:
        return await api_client.get_status(env)
    except ServiceUnavailableError:
        cached = await cache.get(f"deploy_status:{env}")
        if cached:
            return {**cached, "_stale": True, "_cached_at": cached["timestamp"]}
        return {
            "status": "unknown",
            "error": f"Cannot reach deployment API. Check status manually.",
        }

Avoiding Infinite Retry Loops#

Agents operating autonomously can get stuck in retry loops, especially when the retry logic is implicit in the agent’s reasoning rather than coded explicitly. Three safeguards:

Global attempt budget. Track total tool invocations per task. If an agent has called the same tool 10 times without progress, force it to stop and report.
Timeout per task. Set a wall-clock deadline. Regardless of retry state, if 5 minutes have elapsed, stop and summarize what happened.
Progress detection. After each retry, check whether the error changed. If you are getting the exact same error, additional retries are unlikely to help. Vary the approach instead.

Error Reporting to Users#

When an error reaches the user, include three things: what the agent was trying to do, what went wrong, and what the user can do about it.

Bad: "Error: request failed". The user learns nothing.

Bad: "HTTPError 503 at https://api.internal.corp/v2/deployments?env=prod&token=sk-abc123...". Leaks internal details and secrets.

Good: "Failed to fetch production deployment status. The deployment API returned a service unavailable error (HTTP 503). This is usually temporary. You can retry in a few minutes, or check the status manually at the deploy dashboard." The user knows the situation, the cause, and their options.

Sanitize errors before surfacing them. Strip URLs with tokens, internal hostnames, stack traces, and raw API responses. The user needs to understand what happened, not debug your infrastructure.