Detecting Infrastructure Knowledge Gaps#

The most dangerous thing an agent can do is confidently produce a deliverable based on wrong assumptions. An agent that assumes x86_64 when the target is ARM64, that assumes PostgreSQL 14 behavior when the target runs 15, or that assumes AWS IAM patterns when the target is Azure – that agent produces a runbook that will fail in ways the human did not expect and may not understand.

The vision document for infrastructure agent platforms identifies this explicitly: home users “don’t know what they don’t know.” The platform catches gotchas for them. But agents themselves have the same problem. An agent trained on predominantly x86_64 infrastructure content will assume x86_64 unless it actively checks. An agent that learned PostgreSQL patterns before version 15 will assume the old permission model unless something triggers a version check.

This article covers how agents can systematically detect their own knowledge gaps before those gaps become production failures.

Common Knowledge Gaps#

These are the blind spots that agents hit most frequently when working with infrastructure. Each one is a place where reasonable-sounding defaults are wrong for a significant fraction of real environments.

Architecture: ARM64 Is Not an Edge Case#

Most agent training data assumes x86_64. Most example Dockerfiles, Helm charts, and deployment guides were written for x86_64. When an agent recommends a container image, it almost never checks whether an ARM64 variant exists.

But ARM64 is not niche. Apple Silicon Macs (M1 through M4) are ARM64. AWS Graviton instances are ARM64. Many home servers and Raspberry Pi clusters are ARM64. An agent that recommends mattermost/mattermost-team-edition:latest to someone running minikube on an M4 Mac has handed them an image that will crash with the QEMU lfstack error – a fatal Go runtime failure with no workaround short of building a native ARM64 image.

The fix is not to always recommend ARM64 images. The fix is to always ask what architecture the target runs.

Database Version: PostgreSQL 15 Changed the Rules#

PostgreSQL 15 changed default permissions on the public schema. Before 15, GRANT ALL PRIVILEGES ON DATABASE was sufficient for an application user to create tables. After 15, the public schema must be explicitly owned by the application user or the user must be granted CREATE permission on the schema.

An agent that generates a database init script without checking the PostgreSQL version will produce a script that works on 14 and fails on 15+. The error message – permission denied for schema public – gives no hint about the version-dependent behavior. A human following the agent’s runbook will spend hours debugging a problem the agent should have anticipated.

Cloud IAM: Every Provider Is Different#

AWS IAM, Azure RBAC, and GCP IAM are superficially similar and fundamentally different. An agent that learned IAM patterns on AWS might suggest creating an IAM role with a trust policy and attaching it to a service account. On Azure, the equivalent involves Azure Active Directory, managed identities, and federated credentials – a completely different set of resources and API calls.

Worse, the terminology overlaps. “Role” means different things on each cloud. “Service account” means different things. An agent that uses AWS mental models to produce Azure configurations will generate something that looks plausible but does not work.

Networking: Defaults Vary Everywhere#

Kubernetes networking defaults differ by distribution and cloud provider:

minikube uses a single-node network with no real load balancer. LoadBalancer services hang in Pending unless minikube tunnel is running.
EKS provisions AWS ALBs for LoadBalancer services but requires the AWS Load Balancer Controller to be installed.
AKS provisions Azure Load Balancers by default but has specific annotations for internal versus external.
GKE provisions GCP load balancers with different health check behaviors than other clouds.
k3s includes Traefik by default, which conflicts if you also install nginx ingress.

An agent that says “create a LoadBalancer service” without knowing the target distribution will produce something that works on one platform and fails silently on another.

Helm Chart Naming: The Bitnami Trap#

Bitnami Helm charts name resources using the Helm release name directly, not following the release-chartname convention that most charts use. An agent that expects to find a PostgreSQL service at myrelease-postgresql will be correct for the Bitnami chart but wrong for many other PostgreSQL charts. Conversely, an agent trained on non-Bitnami charts will get the service name wrong when the human is using Bitnami.

The service name matters because it goes into database connection strings, health check targets, and inter-service communication. A wrong service name means connection timeouts that are hard to trace.

Detection Strategies#

Knowing the common gaps is necessary but not sufficient. The agent needs systematic strategies for detecting which gaps apply to the current task.

Strategy 1: Check Architecture Before Recommending Images#

Before recommending any container image, check the target architecture. This is a concrete, automatable check.

# In a sandbox or on the target system
uname -m
# Returns: x86_64, aarch64, arm64

# Or check Kubernetes node architecture
kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.architecture}'
# Returns: amd64 or arm64

If the target is ARM64, the agent must verify that every recommended image has an ARM64 variant. For multi-arch images, this is automatic. For images without ARM64 support, the agent must find an alternative or document that a custom build is required.

# Check if an image supports arm64
docker manifest inspect nginx:latest | jq '.manifests[].platform.architecture'
# ["amd64", "arm64"] -- good, multi-arch

docker manifest inspect mattermost/mattermost-team-edition:latest | jq '.manifests[].platform.architecture'
# ["amd64"] -- problem on ARM64

This check should happen before the agent writes a single line of YAML. Discovering an architecture mismatch after generating an entire deployment is wasted effort.

Strategy 2: Verify Database Versions Before Assuming Behaviors#

Different database versions have different default behaviors. The agent should check versions before generating configuration.

# PostgreSQL version
psql -U postgres -c "SELECT version();"
# Or from Kubernetes
kubectl exec -it db-postgresql-0 -n myns -- psql -U postgres -c "SELECT version();"

# MySQL version
mysql -u root -e "SELECT VERSION();"

# Redis version
redis-cli INFO server | grep redis_version

For PostgreSQL, the version determines whether the agent needs to include ALTER SCHEMA public OWNER TO in init scripts. For MySQL, it determines whether the mysql_native_password plugin is available (deprecated in 8.4+). For Redis, it determines whether ACL commands are available (6.0+).

Version checks should trigger specific adjustments in the generated runbook:

PostgreSQL version >= 15:
  -> Add ALTER SCHEMA public OWNER TO in initdb script
  -> Use .sh init script format (not .sql) to avoid \c metacommand issues

PostgreSQL version >= 16:
  -> Additionally check for pg_read_all_data role changes

MySQL version >= 8.4:
  -> Use caching_sha2_password instead of mysql_native_password
  -> Some legacy clients may need authentication plugin configuration

Strategy 3: Probe for Existing Infrastructure#

Before proposing new resources, check what already exists. An agent that deploys a second ingress controller because it did not check for an existing one creates conflicts that are difficult to debug.

# Check for existing ingress controllers
kubectl get ingressclass
kubectl get pods -A | grep -i ingress

# Check for existing storage classes
kubectl get storageclass

# Check for existing cert-manager
kubectl get pods -A | grep cert-manager

# Check for existing service mesh
kubectl get pods -A | grep -E "istio|linkerd|consul"

# Check for existing monitoring
kubectl get pods -A | grep -E "prometheus|grafana"

Each discovery changes the runbook. If cert-manager is already running, the runbook should reference existing ClusterIssuers rather than installing cert-manager from scratch. If a service mesh is present, the runbook needs to account for sidecar injection and mTLS.

Strategy 4: Verify Cloud Provider Before Assuming Patterns#

When working with Kubernetes on a cloud provider, the agent should detect which cloud it is running on and adjust accordingly.

# Detect cloud provider from node labels
kubectl get nodes -o jsonpath='{.items[0].spec.providerID}'
# aws:///us-east-1a/i-0abc123  -> AWS
# azure:///subscriptions/...    -> Azure
# gce:///projects/...           -> GCP

# Or check for cloud-specific components
kubectl get pods -n kube-system | grep -E "aws-node|azure-cni|gke-metadata"

Once the cloud is identified, the agent applies cloud-specific patterns for IAM, load balancers, storage, and networking. A runbook generated for EKS should use IRSA (IAM Roles for Service Accounts), not generic Kubernetes service account tokens. A runbook for AKS should use Azure Workload Identity, not IRSA.

The Assumption Audit Pattern#

Before executing any step in a runbook, the agent should perform an assumption audit – an explicit listing of every assumption being made, followed by verification of each one.

## Assumption Audit: Deploy Redis Cluster

| # | Assumption | How to Verify | Result |
|---|---|---|---|
| 1 | Target architecture is amd64 | `kubectl get nodes -o jsonpath='{..architecture}'` | **amd64** -- confirmed |
| 2 | Kubernetes version is 1.28+ | `kubectl version --short` | **1.29.2** -- confirmed |
| 3 | Redis image supports target arch | `docker manifest inspect redis:7.2` | **multi-arch** -- confirmed |
| 4 | StorageClass "gp2" exists | `kubectl get sc gp2` | **NOT FOUND** -- gp3 available |
| 5 | No existing Redis deployment | `kubectl get pods -A -l app=redis` | **None found** -- confirmed |
| 6 | Network policies allow pod-to-pod | `kubectl get networkpolicies -n target` | **deny-all exists** -- must add allow rule |
| 7 | Namespace "cache" exists | `kubectl get ns cache` | **NOT FOUND** -- must create |

This audit caught two wrong assumptions: the storage class name and the existence of a deny-all network policy. Without the audit, the agent would have generated a runbook that fails at the PVC creation step (wrong StorageClass) and silently produces a Redis instance that no other pod can reach (blocked by network policy).

The assumption audit should be a standard prefix to runbook generation, not an optional step. The cost is a few extra API calls or kubectl commands. The benefit is avoiding runbooks that fail for preventable reasons.

Building a Pre-Flight Checklist#

A pre-flight checklist standardizes the assumption audit into a reusable template. Before generating any infrastructure deliverable, the agent runs through this checklist.

preflight_checklist:
  environment:
    - check: "Target OS and architecture"
      command: "uname -sm"
      why: "Determines image compatibility, build requirements"

    - check: "Kubernetes version"
      command: "kubectl version --short 2>/dev/null || echo 'Not available'"
      why: "API deprecations, feature availability"

    - check: "Kubernetes distribution"
      command: "kubectl get nodes -o jsonpath='{.items[0].spec.providerID}'"
      why: "Cloud-specific behaviors, IAM model, networking"

  cluster_state:
    - check: "Available StorageClasses"
      command: "kubectl get sc -o jsonpath='{.items[*].metadata.name}'"
      why: "PVC requests must reference existing StorageClasses"

    - check: "Existing ingress controllers"
      command: "kubectl get ingressclass -o jsonpath='{.items[*].metadata.name}'"
      why: "Avoid deploying duplicate ingress controllers"

    - check: "Namespace network policies"
      command: "kubectl get networkpolicies -n TARGET_NS"
      why: "Deny-all policies block all traffic unless allow rules added"

    - check: "Existing deployments in target namespace"
      command: "kubectl get all -n TARGET_NS"
      why: "Avoid resource naming conflicts"

  database:
    - check: "Database engine and version"
      command: "psql -c 'SELECT version();' OR mysql -e 'SELECT VERSION();'"
      why: "Version-specific permission models, deprecated features"

    - check: "Existing schemas and tables"
      command: "psql -c '\\dt' OR mysql -e 'SHOW TABLES;'"
      why: "Avoid schema conflicts, understand migration state"

  security:
    - check: "Pod Security Standards in effect"
      command: "kubectl get ns TARGET_NS -o jsonpath='{.metadata.labels}' | grep pod-security"
      why: "Restricted PSS blocks privileged containers, host networking, etc."

    - check: "RBAC constraints"
      command: "kubectl auth can-i --list -n TARGET_NS"
      why: "Determines what the agent/user can actually deploy"

    - check: "Secret management approach"
      command: "kubectl get pods -A | grep -E 'vault|sealed-secrets|external-secrets'"
      why: "Determines how secrets should be provided (raw, sealed, external)"

An agent running this checklist before generating a runbook gathers enough information to avoid the most common failure modes. The checklist takes seconds to execute but prevents hours of debugging.

Practical Example: An Agent Catches Its Own Wrong Assumption#

Here is a concrete scenario showing the gap detection workflow in action.

A human asks: “Set up Mattermost on my home Kubernetes cluster.”

Without gap detection, the agent would:

Recommend the official Mattermost Helm chart
Use the default mattermost/mattermost-team-edition:latest image
Deploy PostgreSQL as a dependency
Generate a runbook

The human would then discover that Mattermost crashes on startup with runtime: lfstack.push invalid packing – the QEMU/Go/ARM64 incompatibility – and spend a weekend debugging it.

With gap detection, the agent runs the pre-flight checklist first:

Pre-flight check: Target architecture
> kubectl get nodes -o jsonpath='{.items[0].status.nodeInfo.architecture}'
> Result: arm64

Pre-flight check: Image compatibility
> docker manifest inspect mattermost/mattermost-team-edition:latest
> Result: amd64 ONLY -- no arm64 variant

KNOWLEDGE GAP DETECTED: The recommended image does not support the target
architecture. Mattermost is written in Go, and Go binaries cannot run under
QEMU on ARM64 (lfstack.push crash). A native ARM64 image must be built.

Adjusting runbook:
- Replace official Docker image with custom ARM64 build
- Add build step using Mattermost's ARM64 binary tarball
- Document the QEMU limitation for future reference

The agent now generates a runbook that includes building a custom ARM64 image from the official binary tarball, testing it in the sandbox, and deploying that image instead. The human gets a runbook that works on their actual hardware.

This is the core value of gap detection. The agent caught a problem it would have missed by default, because it systematically checked the assumptions underlying its recommendations. The human’s three-weekend debugging session becomes a two-minute sandbox run.

When You Cannot Detect the Gap#

Some knowledge gaps cannot be detected through automated checks. Compliance requirements, team conventions, undocumented infrastructure decisions – these exist in human context that the agent does not have access to.

For these cases, the runbook should explicitly list its remaining assumptions and ask the human to verify them:

## Assumptions Not Verified

The following assumptions could not be verified automatically. Please confirm
before executing this runbook on your infrastructure:

- [ ] No compliance requirements prohibit running containers as root
      (the PostgreSQL container runs as root during init)
- [ ] The team does not have a preferred secret management approach
      (this runbook uses plain Kubernetes Secrets)
- [ ] There is no existing database that should be reused instead of
      deploying a new one
- [ ] The target namespace does not have resource quotas that would
      block this deployment (total requests: 768Mi memory, 350m CPU)

Listing unverified assumptions is not a weakness – it is an honest accounting of what the agent knows and does not know. A runbook with a clear “verify these assumptions” section is more trustworthy than one that silently assumes everything is fine.