Security Incident Response for Infrastructure

Incident Response Overview#

Security incidents in infrastructure environments follow a predictable lifecycle. The difference between a contained incident and a catastrophic breach is usually preparation and speed of response. This playbook covers the six phases of incident response with specific commands and procedures for Kubernetes and containerized infrastructure.

The phases are sequential but overlap in practice: you may be containing one aspect of an incident while still detecting the full scope.

Phase 1: Detection#

Detection comes from three sources: automated alerts, human observation, and external reports.

Automated Detection Sources#

Kubernetes audit logs reveal unauthorized API calls, privilege escalation attempts, and suspicious resource creation:

# Search audit logs for privilege escalation attempts
cat /var/log/kubernetes/audit/audit.log | jq 'select(.verb == "create" and .objectRef.resource == "clusterrolebindings")'

# Find who accessed secrets in the last hour
cat /var/log/kubernetes/audit/audit.log | jq 'select(.objectRef.resource == "secrets" and .requestReceivedTimestamp > "2026-02-22T10:00:00Z")'

# Detect exec into pods (common post-compromise action)
cat /var/log/kubernetes/audit/audit.log | jq 'select(.objectRef.subresource == "exec")'

Falco runtime alerts detect anomalous container behavior:

# Check Falco alerts for the last hour
kubectl logs -n falco daemonset/falco --since=1h | grep -i "Warning\|Error\|Critical"

# Common Falco rules that fire during incidents:
# - Terminal shell in container
# - Read sensitive file untouched
# - Write below /etc
# - Contact K8s API server from container
# - Unexpected outbound connection

Network flow logs show unusual traffic patterns:

# Check for connections to known malicious IPs (example with conntrack)
conntrack -L | grep -f known-malicious-ips.txt

# Check for unusual outbound connections from pods
kubectl exec -n monitoring netflow-collector -- \
  flow-query --src-namespace production --dst-external --last 1h

Detection Indicators#

Classify what you are seeing before escalating:

Indicator	Severity	Example
Unexpected privileged pod	Critical	Pod with hostPID, hostNetwork, or privileged: true in a workload namespace
Service account token used from outside cluster	Critical	API server audit log showing token use from an external IP
Unusual DNS queries	High	DNS lookups for cryptocurrency mining pools or C2 domains
Unexpected exec into pod	High	kubectl exec from an unfamiliar user or service account
Spike in 403/401 responses	Medium	Brute-force or credential stuffing attempt
New image from unknown registry	High	Pod using image not from your approved registry

Phase 2: Triage#

Triage determines severity, scope, and urgency. Answer these questions in order.

Severity Classification#

SEV-1 (Critical): Active data exfiltration, credential compromise confirmed,
                   unauthorized access to production data, cryptominer on production nodes
SEV-2 (High):     Compromised pod with no confirmed lateral movement,
                   vulnerability actively being exploited, unauthorized configuration change
SEV-3 (Medium):   Suspicious activity without confirmed compromise,
                   vulnerability detected but no exploitation evidence
SEV-4 (Low):      Policy violation, misconfiguration detected by scanner,
                   failed unauthorized access attempt

Scope Assessment#

Determine how far the compromise extends:

# Identify all pods using the same service account as the compromised pod
kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] | select(.spec.serviceAccountName == "COMPROMISED_SA") | "\(.metadata.namespace)/\(.metadata.name)"'

# Check if the compromised service account has cluster-wide permissions
kubectl auth can-i --list --as=system:serviceaccount:NAMESPACE:COMPROMISED_SA

# Check for lateral movement: pods created after the incident start time
kubectl get pods --all-namespaces --sort-by=.metadata.creationTimestamp | \
  tail -20

# Check for new RBAC bindings created during the incident window
kubectl get clusterrolebindings,rolebindings --all-namespaces \
  --sort-by=.metadata.creationTimestamp -o json | \
  jq '.items[] | select(.metadata.creationTimestamp > "2026-02-22T10:00:00Z") | {name: .metadata.name, namespace: .metadata.namespace, role: .roleRef.name}'

Phase 3: Containment#

Containment prevents the incident from spreading. Act fast but preserve evidence before destroying it.

Scenario: Compromised Pod#

# Step 1: Isolate the pod with a network policy that blocks all traffic
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: isolate-compromised-pod
  namespace: NAMESPACE
spec:
  podSelector:
    matchLabels:
      app: COMPROMISED_APP
  policyTypes:
    - Ingress
    - Egress
  # Empty ingress/egress = deny all
EOF

# Step 2: Capture the pod state before modifying anything
kubectl get pod COMPROMISED_POD -n NAMESPACE -o yaml > evidence/pod-spec.yaml
kubectl logs COMPROMISED_POD -n NAMESPACE --all-containers > evidence/pod-logs.txt
kubectl describe pod COMPROMISED_POD -n NAMESPACE > evidence/pod-describe.txt

# Step 3: Capture the container filesystem for forensics
# Get the container ID
CONTAINER_ID=$(kubectl get pod COMPROMISED_POD -n NAMESPACE -o jsonpath='{.status.containerStatuses[0].containerID}' | sed 's|containerd://||')

# On the node where the pod is running:
# Export container filesystem
crictl export $CONTAINER_ID > evidence/container-filesystem.tar

# Step 4: Capture running processes
kubectl exec COMPROMISED_POD -n NAMESPACE -- ps auxww > evidence/processes.txt 2>/dev/null || true

# Step 5: Scale down the deployment to zero (stops new pods but keeps the resource)
kubectl scale deployment COMPROMISED_DEPLOYMENT -n NAMESPACE --replicas=0

Scenario: Leaked Credentials#

# Step 1: Identify what credentials were leaked
# Check if it's a service account token
kubectl get secrets -n NAMESPACE -o json | \
  jq '.items[] | select(.type == "kubernetes.io/service-account-token") | .metadata.name'

# Step 2: Delete the compromised service account token
kubectl delete secret COMPROMISED_SECRET -n NAMESPACE

# Step 3: Rotate the service account (delete and recreate)
kubectl delete serviceaccount COMPROMISED_SA -n NAMESPACE
kubectl create serviceaccount COMPROMISED_SA -n NAMESPACE

# Step 4: Restart all pods using that service account to pick up new tokens
kubectl rollout restart deployment -n NAMESPACE -l serviceaccount=COMPROMISED_SA

# Step 5: If cloud provider credentials were leaked, rotate them immediately
# AWS example:
aws iam create-access-key --user-name SERVICE_USER
aws iam delete-access-key --user-name SERVICE_USER --access-key-id OLD_KEY_ID

# Update the Kubernetes secret with the new credentials
kubectl create secret generic aws-creds -n NAMESPACE \
  --from-literal=AWS_ACCESS_KEY_ID=NEW_KEY \
  --from-literal=AWS_SECRET_ACCESS_KEY=NEW_SECRET \
  --dry-run=client -o yaml | kubectl apply -f -

# Step 6: Check audit logs for any use of the leaked credentials
cat /var/log/kubernetes/audit/audit.log | \
  jq 'select(.user.username == "system:serviceaccount:NAMESPACE:COMPROMISED_SA")'

Scenario: Unauthorized Access to Cluster#

# Step 1: Identify the unauthorized user/token
# Check API server audit logs for the source IP and identity
cat /var/log/kubernetes/audit/audit.log | \
  jq 'select(.sourceIPs[] == "SUSPICIOUS_IP") | {user: .user.username, verb: .verb, resource: .objectRef.resource, time: .requestReceivedTimestamp}'

# Step 2: Revoke the user's access immediately
# If using RBAC, remove their bindings
kubectl delete clusterrolebinding UNAUTHORIZED_BINDING
kubectl delete rolebinding UNAUTHORIZED_BINDING -n NAMESPACE

# If using OIDC, revoke the token at the identity provider

# Step 3: If a kubeconfig was compromised, regenerate cluster certificates
# WARNING: This is disruptive and affects all cluster users
# Only do this if the compromise is confirmed and severe

# Step 4: Block the source IP at the network level
# Cloud provider firewall or network policy
# AWS Security Group example:
aws ec2 revoke-security-group-ingress --group-id SG_ID \
  --protocol tcp --port 6443 --cidr SUSPICIOUS_IP/32

# Step 5: Enable additional audit logging if not already active
cat <<EOF > /etc/kubernetes/audit-policy-emergency.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["secrets", "configmaps"]
      - group: "rbac.authorization.k8s.io"
        resources: ["*"]
EOF

Phase 4: Eradication#

After containment, remove the root cause. This means eliminating the attacker’s access, patching the vulnerability they exploited, and removing any persistence mechanisms they established.

# Check for backdoor pods or deployments
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.spec.containers[].image | test("(crypto|miner|xmr|monero)"; "i")) | "\(.metadata.namespace)/\(.metadata.name)"'

# Check for unauthorized cron jobs
kubectl get cronjobs --all-namespaces

# Check for modified configmaps or secrets that might contain backdoors
kubectl get configmaps --all-namespaces --sort-by=.metadata.creationTimestamp -o json | \
  jq '.items[] | select(.metadata.creationTimestamp > "INCIDENT_START_TIME") | {name: .metadata.name, namespace: .metadata.namespace}'

# Check for webhook configurations (persistence mechanism)
kubectl get mutatingwebhookconfigurations,validatingwebhookconfigurations

# Scan all running images for known vulnerabilities
for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
  for img in $(kubectl get pods -n $ns -o jsonpath='{.spec.containers[*].image}' 2>/dev/null); do
    echo "Scanning $ns: $img"
    trivy image --severity CRITICAL "$img" 2>/dev/null
  done
done

Phase 5: Recovery#

Recovery restores normal operations with confidence that the environment is clean.

# Step 1: Redeploy from known-good images (never reuse compromised images)
kubectl set image deployment/DEPLOYMENT CONTAINER=registry.example.com/app:KNOWN_GOOD_TAG -n NAMESPACE

# Step 2: Verify the deployment is healthy
kubectl rollout status deployment/DEPLOYMENT -n NAMESPACE
kubectl get pods -n NAMESPACE -l app=APP_NAME

# Step 3: Re-enable network access gradually
# First allow internal traffic only
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: recovery-policy
  namespace: NAMESPACE
spec:
  podSelector:
    matchLabels:
      app: APP_NAME
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: NAMESPACE
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: NAMESPACE
    - to:  # Allow DNS
        - namespaceSelector: {}
      ports:
        - port: 53
          protocol: UDP
EOF

# Step 4: Monitor closely for recurrence
kubectl logs -f deployment/DEPLOYMENT -n NAMESPACE &
kubectl get events -n NAMESPACE --watch &

# Step 5: Once stable, restore normal network policies
kubectl delete networkpolicy isolate-compromised-pod -n NAMESPACE
kubectl delete networkpolicy recovery-policy -n NAMESPACE
kubectl apply -f normal-network-policies/

Phase 6: Post-Incident Review#

The post-incident review (also called a postmortem or retrospective) happens within 48 hours of incident resolution while details are fresh. The goal is learning, not blame.

Review Document Structure#

# Incident Review: [Title]
Date: YYYY-MM-DD
Severity: SEV-X
Duration: Detection to resolution
Lead responder: [Name]

## Timeline
- HH:MM - First alert triggered (source: Falco/audit log/human)
- HH:MM - Incident confirmed, triage began
- HH:MM - Containment actions taken
- HH:MM - Root cause identified
- HH:MM - Eradication complete
- HH:MM - Recovery complete, normal operations restored

## Root Cause
What vulnerability or misconfiguration was exploited? How did the attacker gain access?

## Detection Gap
How long was the attacker active before detection? What would have detected them sooner?

## What Went Well
- Actions that limited blast radius
- Tools that provided useful data
- Communication that worked

## What Could Be Improved
- Detection gaps
- Runbook gaps
- Tool limitations
- Communication breakdowns

## Action Items
| Action | Owner | Due Date | Status |
|---|---|---|---|
| Patch CVE-XXXX-XXXXX across all clusters | Platform team | YYYY-MM-DD | Open |
| Add Falco rule for [specific behavior] | Security team | YYYY-MM-DD | Open |
| Restrict SA permissions in namespace X | App team | YYYY-MM-DD | Open |
| Update incident response runbook | Security team | YYYY-MM-DD | Open |

Key Metrics to Track#

Mean time to detect (MTTD): Time from incident start to first alert.
Mean time to contain (MTTC): Time from detection to containment.
Mean time to resolve (MTTR): Time from detection to full recovery.
Blast radius: Number of affected systems, users, or data records.
Action item completion rate: Percentage of post-incident actions completed on schedule.

Track these metrics across incidents over time. Improving MTTD matters more than improving MTTR because earlier detection means smaller blast radius. Invest in detection capabilities before investing in faster response.

Every incident is a forced learning opportunity. Organizations that skip the review repeat the same mistakes. Organizations that complete the review and follow through on action items build genuinely more resilient infrastructure.