Kubernetes Events Debugging#
Kubernetes events are the cluster’s built-in audit trail for what is happening to resources. When a pod fails to schedule, a container crashes, a node runs out of disk, or a volume fails to mount, the system records an event. Events are the first place to look when something goes wrong, and learning to read them efficiently separates quick diagnosis from hours of guessing.
Event Structure#
Every Kubernetes event has these fields:
| Field | Description |
|---|---|
type |
Normal or Warning. Normal events are informational. Warning events indicate problems. |
reason |
Machine-readable cause: Scheduled, Pulling, Started, BackOff, FailedScheduling, etc. |
message |
Human-readable description of what happened. |
involvedObject |
The resource the event is about (Pod, Node, Deployment, PVC, etc.). |
source |
The component that generated the event (kubelet, scheduler, controller-manager). |
firstTimestamp |
When the event first occurred. |
lastTimestamp |
When the event most recently occurred. |
count |
How many times the event has been observed. A high count means the problem is repeating. |
Events are not persisted indefinitely. By default, the API server keeps events for 1 hour. After that, they are garbage collected. If you need historical events, export them to a logging system.
Viewing Events#
All Events in a Namespace#
# Default listing, sorted by last timestamp
kubectl get events -n production
# Sort by creation timestamp for chronological order
kubectl get events -n production --sort-by='.lastTimestamp'
# Watch events in real time
kubectl get events -n production --watchEvents for a Specific Resource#
# Events for a specific pod
kubectl describe pod my-pod -n production
# The Events section at the bottom shows all events for this pod
# Events for a specific deployment
kubectl describe deployment web-api -n production
# Events for a specific node
kubectl describe node worker-1Events Across All Namespaces#
kubectl get events --all-namespaces --sort-by='.lastTimestamp'Filtering Events#
Raw event output is noisy. Filtering is essential for finding the signal.
Filter by Type (Warning Only)#
# Show only warning events -- these are the ones that indicate problems
kubectl get events -n production --field-selector type=WarningThis is the single most useful filter. Normal events tell you things are working. Warning events tell you things are broken.
Filter by Reason#
# Find all scheduling failures
kubectl get events --all-namespaces --field-selector reason=FailedScheduling
# Find all image pull failures
kubectl get events --all-namespaces --field-selector reason=Failed
# Find all OOMKilled events (requires searching message text)
kubectl get events --all-namespaces -o json | \
jq -r '.items[] | select(.message | test("OOM")) |
"\(.metadata.namespace)/\(.involvedObject.name): \(.message)"'Filter by Involved Object#
# Events for a specific object type
kubectl get events -n production --field-selector involvedObject.kind=Pod
# Events for a specific named resource
kubectl get events -n production \
--field-selector involvedObject.name=web-api-7d4f8b6c9-x2k4p
# Events for a specific node
kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name=worker-1Combined Filters#
# Warning events for pods in production
kubectl get events -n production \
--field-selector type=Warning,involvedObject.kind=PodCustom Output Columns#
# Compact output with the fields that matter
kubectl get events -n production \
-o custom-columns=TIME:.lastTimestamp,TYPE:.type,REASON:.reason,OBJECT:.involvedObject.name,MESSAGE:.messageCommon Event Patterns and What They Mean#
Scheduling Failures#
Event: FailedScheduling with message containing Insufficient cpu or Insufficient memory
Warning FailedScheduling pod/web-api-xyz 0/3 nodes are available: 3 Insufficient cpu.Fix: Reduce resource requests, add nodes, or check current allocation with kubectl describe nodes | grep -A 5 "Allocated resources".
If the message mentions node(s) had taint, all nodes have taints the pod does not tolerate. Add tolerations to the pod spec or untaint the nodes.
Image Pull Failures#
Event: Failed with message Failed to pull image or ErrImagePull
Warning Failed pod/web-api-xyz Failed to pull image "registry.example.com/web-api:2.0.0": rpc error: code = NotFound desc = failed to pull and unpack image
Warning BackOff pod/web-api-xyz Back-off pulling image "registry.example.com/web-api:2.0.0"Cause: The image does not exist, the tag is wrong, or the node cannot authenticate with the registry.
Fix: Verify the image and tag exist. Check image pull secrets:
# Verify the image exists
docker manifest inspect registry.example.com/web-api:2.0.0
# Check if the pod has an imagePullSecret configured
kubectl get pod web-api-xyz -n production -o jsonpath='{.spec.imagePullSecrets}'
# Verify the secret exists and has valid credentials
kubectl get secret regcred -n production -o jsonpath='{.data.\.dockerconfigjson}' | base64 -dCrashLoopBackOff#
Event: BackOff with message Back-off restarting failed container
Warning BackOff pod/worker-abc Back-off restarting failed containerCause: The container starts, crashes, and Kubernetes restarts it with exponentially increasing delays. The container’s own logs explain the actual error.
Fix:
# Check the container's logs from the current (or previous crashed) instance
kubectl logs worker-abc -n production
kubectl logs worker-abc -n production --previous
# Check the exit code
kubectl get pod worker-abc -n production -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Exit code 1: application error. Exit code 137: OOMKilled. Exit code 139: segfault.Volume Mount Failures#
Event: FailedMount – the PersistentVolume cannot be mounted (attached to another node, storage class unavailable, PVC pending).
kubectl get pvc -n production
kubectl describe pvc data-db-0 -n productionProbe Failures#
Event: Unhealthy with liveness or readiness probe details. Liveness failures restart the container. Readiness failures remove the pod from service endpoints. Common cause: initialDelaySeconds too short.
kubectl get pod web-api-xyz -n production -o jsonpath='{.spec.containers[0].livenessProbe}'OOMKilled#
Not always a direct event, but visible in pod status. The container exceeded its memory limit.
kubectl get pods -n production -o json | \
jq -r '.items[] | .status.containerStatuses[]? |
select(.lastState.terminated.reason == "OOMKilled") |
"\(.name) restartCount=\(.restartCount)"'Node-Level Events#
Node events reveal infrastructure issues – NodeNotReady, EvictionThreshold, OOMKilling, KernelDeadlock:
kubectl describe node worker-1 | tail -30
# Find nodes with pressure conditions
kubectl get nodes -o json | \
jq -r '.items[] | select(.status.conditions[] | select(.type != "Ready" and .status == "True")) |
"\(.metadata.name): \([.status.conditions[] | select(.status == "True") | .type])"'Event-Based Alerting#
Kubernetes Event Exporter#
Event Exporter watches all cluster events and forwards them to external sinks (Slack, Elasticsearch, webhooks). Configure it to route warning events to alerting channels and all events to a log store for post-incident analysis:
# event-exporter-config.yaml (ConfigMap data)
logLevel: error
route:
routes:
- match:
- receiver: "slack-warnings"
kind: "Pod|Node|Deployment"
type: "Warning"
- match:
- receiver: "elasticsearch-all"
receivers:
- name: "slack-warnings"
webhook:
endpoint: "https://hooks.slack.com/services/T00/B00/xxx"
headers:
Content-Type: application/json
layout:
text: "{{ .Type }} {{ .Reason }} in {{ .Namespace }}/{{ .InvolvedObject.Name }}: {{ .Message }}"
- name: "elasticsearch-all"
elasticsearch:
hosts:
- "http://elasticsearch:9200"
index: kube-events
useEventID: trueKubewatch#
Kubewatch is a simpler tool focused on resource state changes. Install via Helm:
helm install kubewatch kubewatch/kubewatch \
--set rbac.create=true \
--set slack.enabled=true \
--set slack.channel="#k8s-alerts" \
--set slack.token="xoxb-your-token" \
--set resourcesToWatch.pod=true \
--set resourcesToWatch.deployment=true \
--set namespaceToWatch="production"Prometheus Event Metrics#
If you run Prometheus with kube-state-metrics, event counts are available as metrics. Create alerting rules for recurring problems:
groups:
- name: kubernetes-events
rules:
- alert: PodSchedulingFailure
expr: increase(kube_event_count{reason="FailedScheduling",type="Warning"}[15m]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Repeated scheduling failures detected"Debugging Workflow Using Events#
When a workload is not behaving as expected, follow this event-driven debugging sequence:
# 1. Get warning events in the namespace, most recent first
kubectl get events -n production --field-selector type=Warning --sort-by='.lastTimestamp'
# 2. If events point to a specific pod, get its full event history
kubectl describe pod <pod-name> -n production
# 3. If events mention scheduling, check node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"
# 4. If events mention image pull, verify the image
kubectl get pod <pod-name> -n production -o jsonpath='{.spec.containers[*].image}'
# 5. If events mention volume mount, check PVC status
kubectl get pvc -n production
# 6. If events mention probe failures, check application logs
kubectl logs <pod-name> -n production --previous
# 7. If no events exist (event TTL expired), check pod status directly
kubectl get pod <pod-name> -n production -o yaml | grep -A 20 "status:"Events are ephemeral by design. For post-incident analysis, make sure your event exporter or logging pipeline is capturing events before you need them. Discovering that events expired before you could read them is one of the more frustrating Kubernetes debugging experiences.