Agent Debugging Patterns#

When an agent produces a wrong answer, the question is always the same: why did it do that? Unlike traditional software where you read a stack trace, agent failures are buried in a chain of LLM decisions, tool calls, and context accumulation. Debugging agents requires specialized observability that captures not just what happened, but what the agent was thinking at each step.

Tracing Agent Decision Chains#

Every agent action follows a decision chain: the model reads its context, decides which tool to call (or whether to respond directly), processes the result, and decides again. To debug failures, you need to see this chain as a structured trace.

The Agent Trace#

Model each agent turn as a span in a distributed trace. The parent span is the user request. Child spans are individual LLM calls and tool invocations.

from opentelemetry import trace

tracer = trace.get_tracer("agent")

async def agent_loop(user_message: str, session_id: str):
    with tracer.start_as_current_span("agent_request", attributes={
        "session.id": session_id,
        "user.message_length": len(user_message),
    }) as request_span:
        messages = build_context(user_message)

        for turn in range(MAX_TURNS):
            with tracer.start_as_current_span(f"llm_call_{turn}") as llm_span:
                llm_span.set_attribute("context.token_count", count_tokens(messages))
                response = await call_llm(messages)
                llm_span.set_attribute("response.has_tool_calls", bool(response.tool_calls))
                llm_span.set_attribute("response.finish_reason", response.finish_reason)

            if response.tool_calls:
                for call in response.tool_calls:
                    with tracer.start_as_current_span(f"tool_{call.name}") as tool_span:
                        tool_span.set_attribute("tool.name", call.name)
                        tool_span.set_attribute("tool.params", json.dumps(call.arguments))
                        result = await execute_tool(call)
                        tool_span.set_attribute("tool.result_length", len(str(result)))
                        tool_span.set_attribute("tool.is_error", result.get("isError", False))
                        messages.append(make_tool_result(call.id, result))
            else:
                request_span.set_attribute("total_turns", turn + 1)
                return response.content

This trace tells you: how many turns the agent took, which tools it called at each turn, how large the context was at each LLM call, and whether any tool returned an error. When something goes wrong, you open the trace and walk the decision chain step by step.

Key Attributes to Capture#

For each LLM call: token count (input and output), finish reason (stop, tool_use, length), model used, latency.

For each tool call: tool name, input parameters, result size, error status, latency.

For the overall request: total turns, total tool calls, total tokens consumed, final response length, session ID.

Logging Tool Calls and Responses#

Structured logging is the foundation of agent debugging. Every tool invocation must be logged with enough detail to reproduce the issue without re-running the agent.

import structlog

logger = structlog.get_logger()

async def execute_tool(call: ToolCall) -> dict:
    start = time.monotonic()

    logger.info("tool_call_start",
        tool=call.name,
        params=redact_sensitive(call.arguments),
        session_id=current_session_id(),
    )

    try:
        result = await tool_registry[call.name](**call.arguments)
        duration = time.monotonic() - start

        logger.info("tool_call_success",
            tool=call.name,
            duration_ms=round(duration * 1000),
            result_size=len(json.dumps(result)),
            result_preview=truncate(str(result), 200),
        )
        return result

    except Exception as e:
        duration = time.monotonic() - start
        logger.error("tool_call_error",
            tool=call.name,
            duration_ms=round(duration * 1000),
            error_type=type(e).__name__,
            error_message=sanitize_error(str(e)),
        )
        return {"isError": True, "content": [{"type": "text", "text": str(e)}]}

Critical rules for agent logging:

Redact sensitive parameters. Tool calls may contain file paths with usernames, API endpoints with tokens, or database queries with credentials. Redact before logging.

Truncate large results. A tool that reads a 10,000-line file should not dump all 10,000 lines into the log. Log a preview (first 200 characters) and the full size.

Log the decision, not just the action. When possible, capture why the agent chose a particular tool. This is hard to extract from the model, but you can infer it from the sequence: if the agent called search followed by read_file, it was looking for something specific.

Identifying Hallucination Patterns#

Agent hallucinations in infrastructure contexts are especially dangerous because they look plausible. The agent might reference a file that does not exist, use a kubectl flag that is not real, or cite a configuration parameter that was never set. There are patterns you can watch for.

File Path Hallucination#

The agent references files it has not actually read. Detect this by comparing tool results against subsequent agent claims.

class HallucinationDetector:
    def __init__(self):
        self.files_read: set[str] = set()
        self.files_confirmed: set[str] = set()

    def on_tool_result(self, tool_name: str, params: dict, result: dict):
        if tool_name == "read_file" and not result.get("isError"):
            self.files_read.add(params["path"])
        if tool_name in ("search", "glob"):
            for path in extract_paths(result):
                self.files_confirmed.add(path)

    def check_response(self, response: str) -> list[str]:
        warnings = []
        mentioned_paths = extract_file_paths(response)
        for path in mentioned_paths:
            if path not in self.files_read and path not in self.files_confirmed:
                warnings.append(f"Agent mentions {path} but never read or confirmed it")
        return warnings

Command Hallucination#

The agent suggests or executes commands with flags or subcommands that do not exist. This happens when the agent generalizes from similar commands. Validate commands against known schemas before execution.

KNOWN_KUBECTL_SUBCOMMANDS = {
    "get", "describe", "logs", "apply", "delete", "create",
    "edit", "patch", "rollout", "scale", "exec", "port-forward",
}

def validate_kubectl_command(args: list[str]) -> list[str]:
    warnings = []
    if args and args[0] not in KNOWN_KUBECTL_SUBCOMMANDS:
        warnings.append(f"Unknown kubectl subcommand: {args[0]}")
    return warnings

Confidence Decay#

Watch for hallucination signals that correlate with context window usage. As the context fills up, the model has less room for reasoning and is more likely to confabulate. Track the ratio of context used to context available at each turn.

def context_pressure(current_tokens: int, max_tokens: int) -> float:
    ratio = current_tokens / max_tokens
    if ratio > 0.85:
        logger.warning("high_context_pressure",
            ratio=round(ratio, 2),
            tokens_used=current_tokens,
            tokens_max=max_tokens,
        )
    return ratio

Timeout and Retry Debugging#

Timeout failures are the hardest to debug because the evidence disappears – the operation was killed before it could report what went wrong.

Layered Timeout Tracking#

Agent systems have timeouts at multiple levels, and they interact in confusing ways. Track all of them.

@dataclass
class TimeoutContext:
    tool_timeout: float        # Individual tool execution limit
    turn_timeout: float        # Single agent turn limit
    session_timeout: float     # Total session wall-clock limit
    elapsed_session: float     # Time spent so far

    def remaining(self) -> dict:
        return {
            "tool": self.tool_timeout,
            "turn": self.turn_timeout,
            "session": self.session_timeout - self.elapsed_session,
        }

When a timeout fires, log which layer triggered it and how much time remained at other layers. A tool timeout at 30 seconds is expected behavior. A session timeout at 5 minutes because the agent retried a failing tool 15 times is a design problem.

Retry Loop Detection#

Agents can get stuck retrying the same failing operation. Detect this by tracking tool call patterns within a session.

class RetryDetector:
    def __init__(self, max_identical_calls: int = 3):
        self.call_history: list[tuple[str, str]] = []
        self.max_identical = max_identical_calls

    def check(self, tool_name: str, params: dict) -> bool:
        key = (tool_name, json.dumps(params, sort_keys=True))
        self.call_history.append(key)

        # Count identical recent calls
        recent = self.call_history[-self.max_identical:]
        if len(recent) == self.max_identical and len(set(recent)) == 1:
            logger.warning("retry_loop_detected",
                tool=tool_name,
                identical_calls=self.max_identical,
            )
            return True
        return False

This detector fires when the agent calls the same tool with the same parameters three times in a row. At that point, either the tool is broken or the agent is stuck. Either way, continuing the loop will not help.

Context Window Management Debugging#

Context window overflow is an invisible failure mode. The model silently loses information as earlier messages get truncated or summarized. Debug this by tracking what the agent can and cannot see.

Context Budget Tracking#

class ContextBudgetTracker:
    def __init__(self, max_tokens: int):
        self.max_tokens = max_tokens
        self.entries: list[dict] = []

    def add_entry(self, role: str, content: str, source: str):
        tokens = count_tokens(content)
        self.entries.append({
            "role": role,
            "source": source,
            "tokens": tokens,
            "timestamp": time.time(),
        })

    def report(self) -> dict:
        total = sum(e["tokens"] for e in self.entries)
        by_source = {}
        for e in self.entries:
            by_source.setdefault(e["source"], 0)
            by_source[e["source"]] += e["tokens"]

        return {
            "total_tokens": total,
            "max_tokens": self.max_tokens,
            "utilization": round(total / self.max_tokens, 2),
            "by_source": dict(sorted(by_source.items(), key=lambda x: -x[1])),
        }

The by_source breakdown tells you what is consuming context. If tool results account for 70% of context, your tools are returning too much data. If system instructions take 30%, they need trimming.

Context Eviction Logging#

When the agent runtime trims old messages to make room, log what was removed. Future debugging depends on knowing whether the agent still had access to a critical piece of information when it made a decision.

def evict_oldest_messages(messages: list, target_tokens: int) -> list:
    evicted = []
    while count_tokens(messages) > target_tokens and len(messages) > 2:
        removed = messages.pop(1)  # Keep system message (index 0)
        evicted.append({
            "role": removed["role"],
            "tokens": count_tokens([removed]),
            "preview": truncate(removed["content"], 100),
        })

    if evicted:
        logger.info("context_eviction",
            messages_removed=len(evicted),
            tokens_freed=sum(e["tokens"] for e in evicted),
            previews=[e["preview"] for e in evicted],
        )

    return messages

Building a Debugging Dashboard#

Combine traces, logs, and metrics into a debugging workflow. The practical approach uses three views.

Session timeline. A chronological view of one agent session showing each LLM call, tool invocation, and result. Click any step to see full inputs and outputs. This is your primary debugging tool for individual failures.

Aggregate metrics. Track across all sessions: average turns per task, tool error rates, timeout frequency, context utilization distribution, and hallucination detection rates. These reveal systemic issues – a tool that fails 30% of the time, a prompt that consistently leads to retry loops.

Anomaly detection. Flag sessions that deviate from normal patterns: unusually high turn counts, same tool called more than 5 times, context utilization above 90%, or tool error rates above the session average. These outliers are where the bugs live.

The investment in agent observability pays off immediately. Without it, debugging an agent means re-running the conversation and hoping you can reproduce the issue. With it, you open the trace and read exactly what happened.