Pod Lifecycle and Probes#
Understanding how Kubernetes starts, monitors, and stops pods is essential for running reliable services. Misconfigurations here cause cascading failures, dropped requests, and restart loops that are difficult to diagnose.
Pod Startup Sequence#
When a pod is scheduled, this is the exact order of operations:
- Init containers run sequentially. Each must exit 0 before the next starts.
- All regular containers start simultaneously.
- postStart hooks fire (in parallel with the container’s main process).
- Startup probe begins checking (if defined).
- Once the startup probe passes, liveness and readiness probes begin.
Init Containers#
Init containers run before your application containers and are used for setup tasks: waiting for a dependency, running database migrations, cloning config from a remote source.
spec:
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z postgres-svc 5432; do echo "waiting for db"; sleep 2; done']
- name: run-migrations
image: web-api:2.1.0
command: ['./migrate', '--up']
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
containers:
- name: web-api
image: web-api:2.1.0Init containers share the pod’s volumes but have their own image and resource requests. If any init container fails, Kubernetes restarts the pod (subject to restartPolicy). They run to completion every time a pod starts – including restarts.
The Three Probes#
Startup Probe#
The startup probe protects slow-starting containers. While it is running, liveness and readiness probes are disabled. Once it passes once, it never runs again for that container.
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 2This gives the application 60 seconds (30 attempts x 2 seconds) to start. Use this for Java apps, apps that load large models, or anything with variable startup time. Without it, a liveness probe with a short timeout will kill the container before it finishes starting.
Liveness Probe#
The liveness probe tells Kubernetes whether the container is alive. If it fails, Kubernetes kills and restarts the container. It answers: “Is this process deadlocked or broken beyond recovery?”
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3Critical mistake: checking downstream dependencies in your liveness probe. If your liveness probe checks whether the database is reachable, and the database goes down, Kubernetes will restart all your application pods – making the outage worse. The liveness probe should only check whether your process is functioning, not whether its dependencies are up.
// GOOD: liveness checks the process itself
func healthz(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
}
// BAD: liveness checks the database -- causes cascading restarts
func healthz(w http.ResponseWriter, r *http.Request) {
if err := db.Ping(); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
}Readiness Probe#
The readiness probe controls whether the pod receives traffic from Services. If it fails, the pod is removed from Service endpoints but not restarted. It answers: “Can this pod handle requests right now?”
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 2This is the right place to check dependencies. If the database is down, the readiness probe should fail so traffic stops flowing to this pod. The pod stays running and can recover when the dependency comes back.
Design pattern: Use /healthz for liveness (process check only) and /ready for readiness (process + critical dependencies).
Lifecycle Hooks#
postStart#
Runs immediately after the container starts, in parallel with the main process. The container is not marked Running until postStart completes. If it fails, the container is killed.
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "echo 'started' > /tmp/started"]Avoid using postStart for anything slow – it blocks the pod from becoming Ready.
preStop#
Runs when Kubernetes decides to terminate the pod (scale-down, node drain, deployment rollout). This is where you implement graceful shutdown.
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]Graceful Shutdown#
When Kubernetes terminates a pod, this happens:
- Pod is marked Terminating. It is removed from Service endpoints.
- preStop hook runs and SIGTERM is sent to PID 1 – these happen simultaneously.
- Kubernetes waits up to
terminationGracePeriodSeconds(default: 30s). - If the process is still running, SIGKILL is sent.
The problem: endpoint removal is asynchronous. The kube-proxy and ingress controllers may still route traffic to your pod for a few seconds after SIGTERM. The fix is a preStop sleep:
spec:
terminationGracePeriodSeconds: 45
containers:
- name: web-api
image: web-api:2.1.0
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]The 5-second sleep in preStop gives the network time to drain connections. Your application should also handle SIGTERM by stopping acceptance of new connections and finishing in-flight requests. Set terminationGracePeriodSeconds high enough to cover the preStop delay plus your application’s drain time.
Complete Example#
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-api
spec:
replicas: 3
selector:
matchLabels:
app: web-api
template:
metadata:
labels:
app: web-api
spec:
terminationGracePeriodSeconds: 45
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z postgres-svc 5432; do sleep 2; done']
containers:
- name: web-api
image: web-api:2.1.0
ports:
- containerPort: 8080
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 2
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 2
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]Debugging Probes#
# See probe failures in events
kubectl describe pod web-api-6d4f8b7c9-x2k4m
# Common event messages:
# "Liveness probe failed: HTTP probe failed with statuscode: 503"
# "Readiness probe failed: connection refused"
# Check if a pod is in a restart loop
kubectl get pods -w
# RESTARTS column incrementing = liveness probe killing the container
# Test the probe endpoint manually
kubectl exec web-api-6d4f8b7c9-x2k4m -- curl -s localhost:8080/healthzIf RESTARTS keeps climbing, check whether your liveness probe is too aggressive (low timeout, low failure threshold) or is checking something it should not be (downstream dependencies).