Kubernetes Events Debugging#

Kubernetes events are the cluster’s built-in audit trail for what is happening to resources. When a pod fails to schedule, a container crashes, a node runs out of disk, or a volume fails to mount, the system records an event. Events are the first place to look when something goes wrong, and learning to read them efficiently separates quick diagnosis from hours of guessing.

Event Structure#

Every Kubernetes event has these fields:

Field	Description
`type`	`Normal` or `Warning`. Normal events are informational. Warning events indicate problems.
`reason`	Machine-readable cause: `Scheduled`, `Pulling`, `Started`, `BackOff`, `FailedScheduling`, etc.
`message`	Human-readable description of what happened.
`involvedObject`	The resource the event is about (Pod, Node, Deployment, PVC, etc.).
`source`	The component that generated the event (kubelet, scheduler, controller-manager).
`firstTimestamp`	When the event first occurred.
`lastTimestamp`	When the event most recently occurred.
`count`	How many times the event has been observed. A high count means the problem is repeating.

Events are not persisted indefinitely. By default, the API server keeps events for 1 hour. After that, they are garbage collected. If you need historical events, export them to a logging system.

Viewing Events#

All Events in a Namespace#

# Default listing, sorted by last timestamp
kubectl get events -n production

# Sort by creation timestamp for chronological order
kubectl get events -n production --sort-by='.lastTimestamp'

# Watch events in real time
kubectl get events -n production --watch

Events for a Specific Resource#

# Events for a specific pod
kubectl describe pod my-pod -n production
# The Events section at the bottom shows all events for this pod

# Events for a specific deployment
kubectl describe deployment web-api -n production

# Events for a specific node
kubectl describe node worker-1

Events Across All Namespaces#

kubectl get events --all-namespaces --sort-by='.lastTimestamp'

Filtering Events#

Raw event output is noisy. Filtering is essential for finding the signal.

Filter by Type (Warning Only)#

# Show only warning events -- these are the ones that indicate problems
kubectl get events -n production --field-selector type=Warning

This is the single most useful filter. Normal events tell you things are working. Warning events tell you things are broken.

Filter by Reason#

# Find all scheduling failures
kubectl get events --all-namespaces --field-selector reason=FailedScheduling

# Find all image pull failures
kubectl get events --all-namespaces --field-selector reason=Failed

# Find all OOMKilled events (requires searching message text)
kubectl get events --all-namespaces -o json | \
  jq -r '.items[] | select(.message | test("OOM")) |
    "\(.metadata.namespace)/\(.involvedObject.name): \(.message)"'

Filter by Involved Object#

# Events for a specific object type
kubectl get events -n production --field-selector involvedObject.kind=Pod

# Events for a specific named resource
kubectl get events -n production \
  --field-selector involvedObject.name=web-api-7d4f8b6c9-x2k4p

# Events for a specific node
kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name=worker-1

Combined Filters#

# Warning events for pods in production
kubectl get events -n production \
  --field-selector type=Warning,involvedObject.kind=Pod

Custom Output Columns#

# Compact output with the fields that matter
kubectl get events -n production \
  -o custom-columns=TIME:.lastTimestamp,TYPE:.type,REASON:.reason,OBJECT:.involvedObject.name,MESSAGE:.message

Common Event Patterns and What They Mean#

Scheduling Failures#

Event: FailedScheduling with message containing Insufficient cpu or Insufficient memory

Warning  FailedScheduling  pod/web-api-xyz  0/3 nodes are available: 3 Insufficient cpu.

Fix: Reduce resource requests, add nodes, or check current allocation with kubectl describe nodes | grep -A 5 "Allocated resources".

If the message mentions node(s) had taint, all nodes have taints the pod does not tolerate. Add tolerations to the pod spec or untaint the nodes.

Image Pull Failures#

Event: Failed with message Failed to pull image or ErrImagePull

Warning  Failed     pod/web-api-xyz  Failed to pull image "registry.example.com/web-api:2.0.0": rpc error: code = NotFound desc = failed to pull and unpack image
Warning  BackOff    pod/web-api-xyz  Back-off pulling image "registry.example.com/web-api:2.0.0"

Cause: The image does not exist, the tag is wrong, or the node cannot authenticate with the registry.

Fix: Verify the image and tag exist. Check image pull secrets:

# Verify the image exists
docker manifest inspect registry.example.com/web-api:2.0.0

# Check if the pod has an imagePullSecret configured
kubectl get pod web-api-xyz -n production -o jsonpath='{.spec.imagePullSecrets}'

# Verify the secret exists and has valid credentials
kubectl get secret regcred -n production -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d

CrashLoopBackOff#

Event: BackOff with message Back-off restarting failed container

Warning  BackOff  pod/worker-abc  Back-off restarting failed container

Cause: The container starts, crashes, and Kubernetes restarts it with exponentially increasing delays. The container’s own logs explain the actual error.

Fix:

# Check the container's logs from the current (or previous crashed) instance
kubectl logs worker-abc -n production
kubectl logs worker-abc -n production --previous

# Check the exit code
kubectl get pod worker-abc -n production -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Exit code 1: application error. Exit code 137: OOMKilled. Exit code 139: segfault.

Volume Mount Failures#

Event: FailedMount – the PersistentVolume cannot be mounted (attached to another node, storage class unavailable, PVC pending).

kubectl get pvc -n production
kubectl describe pvc data-db-0 -n production

Probe Failures#

Event: Unhealthy with liveness or readiness probe details. Liveness failures restart the container. Readiness failures remove the pod from service endpoints. Common cause: initialDelaySeconds too short.

kubectl get pod web-api-xyz -n production -o jsonpath='{.spec.containers[0].livenessProbe}'

OOMKilled#

Not always a direct event, but visible in pod status. The container exceeded its memory limit.

kubectl get pods -n production -o json | \
  jq -r '.items[] | .status.containerStatuses[]? |
    select(.lastState.terminated.reason == "OOMKilled") |
    "\(.name) restartCount=\(.restartCount)"'

Node-Level Events#

Node events reveal infrastructure issues – NodeNotReady, EvictionThreshold, OOMKilling, KernelDeadlock:

kubectl describe node worker-1 | tail -30

# Find nodes with pressure conditions
kubectl get nodes -o json | \
  jq -r '.items[] | select(.status.conditions[] | select(.type != "Ready" and .status == "True")) |
    "\(.metadata.name): \([.status.conditions[] | select(.status == "True") | .type])"'

Event-Based Alerting#

Kubernetes Event Exporter#

Event Exporter watches all cluster events and forwards them to external sinks (Slack, Elasticsearch, webhooks). Configure it to route warning events to alerting channels and all events to a log store for post-incident analysis:

# event-exporter-config.yaml (ConfigMap data)
logLevel: error
route:
  routes:
    - match:
        - receiver: "slack-warnings"
          kind: "Pod|Node|Deployment"
          type: "Warning"
    - match:
        - receiver: "elasticsearch-all"
receivers:
  - name: "slack-warnings"
    webhook:
      endpoint: "https://hooks.slack.com/services/T00/B00/xxx"
      headers:
        Content-Type: application/json
      layout:
        text: "{{ .Type }} {{ .Reason }} in {{ .Namespace }}/{{ .InvolvedObject.Name }}: {{ .Message }}"
  - name: "elasticsearch-all"
    elasticsearch:
      hosts:
        - "http://elasticsearch:9200"
      index: kube-events
      useEventID: true

Kubewatch#

Kubewatch is a simpler tool focused on resource state changes. Install via Helm:

helm install kubewatch kubewatch/kubewatch \
  --set rbac.create=true \
  --set slack.enabled=true \
  --set slack.channel="#k8s-alerts" \
  --set slack.token="xoxb-your-token" \
  --set resourcesToWatch.pod=true \
  --set resourcesToWatch.deployment=true \
  --set namespaceToWatch="production"

Prometheus Event Metrics#

If you run Prometheus with kube-state-metrics, event counts are available as metrics. Create alerting rules for recurring problems:

groups:
- name: kubernetes-events
  rules:
  - alert: PodSchedulingFailure
    expr: increase(kube_event_count{reason="FailedScheduling",type="Warning"}[15m]) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Repeated scheduling failures detected"

Debugging Workflow Using Events#

When a workload is not behaving as expected, follow this event-driven debugging sequence:

# 1. Get warning events in the namespace, most recent first
kubectl get events -n production --field-selector type=Warning --sort-by='.lastTimestamp'

# 2. If events point to a specific pod, get its full event history
kubectl describe pod <pod-name> -n production

# 3. If events mention scheduling, check node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"

# 4. If events mention image pull, verify the image
kubectl get pod <pod-name> -n production -o jsonpath='{.spec.containers[*].image}'

# 5. If events mention volume mount, check PVC status
kubectl get pvc -n production

# 6. If events mention probe failures, check application logs
kubectl logs <pod-name> -n production --previous

# 7. If no events exist (event TTL expired), check pod status directly
kubectl get pod <pod-name> -n production -o yaml | grep -A 20 "status:"

Events are ephemeral by design. For post-incident analysis, make sure your event exporter or logging pipeline is capturing events before you need them. Discovering that events expired before you could read them is one of the more frustrating Kubernetes debugging experiences.