Scenario: Recovering from a Failed Deployment#
You are helping when someone reports: “we deployed a new version and it is causing errors,” “pods are not starting,” or “the service is down after a deploy.” The goal is to restore service as quickly as possible, then prevent recurrence.
Time matters here. Every minute of diagnosis while the service is degraded is a minute of user impact. The bias should be toward fast rollback first, then root cause analysis second.
Step 1 – Assess the Damage#
Before doing anything, understand the current state. Is the service fully down, or is it partially degraded with some old pods still serving traffic?
# Check the rollout status -- is it stuck or progressing?
kubectl rollout status deployment/my-app -n production
# If stuck, you will see something like:
# "Waiting for deployment "my-app" rollout to finish: 2 out of 5 new replicas have been updated..."
# This means the rollout is stalled -- old pods may still be running
# Get the current pod state
kubectl get pods -l app=my-app -n production -o wide
# Check which ReplicaSets exist (old vs new)
kubectl get rs -l app=my-app -n production
# The ReplicaSet with DESIRED > 0 and READY < DESIRED is the new (failing) one
# The ReplicaSet with READY > 0 from the previous revision may still be serving trafficKey questions to answer:
- How many old pods are still running? If the rolling update has not completed, old pods are still serving traffic, meaning the service is degraded but not down.
- How many new pods exist, and what state are they in? CrashLoopBackOff, Pending, ImagePullBackOff, and Running-but-erroring each point to different root causes.
- Is the deployment strategy set to
RollingUpdateorRecreate? IfRecreate, all old pods were killed before new pods started – the service is fully down.
# Check the deployment strategy
kubectl get deployment my-app -n production -o jsonpath='{.spec.strategy.type}'
# Check error rates if you have monitoring
# Prometheus: rate of 5xx responses
# rate(http_requests_total{status=~"5.."}[5m])Step 2 – Quick Triage: Identify the Failure Mode#
The pod status tells you what category of failure you are dealing with:
CrashLoopBackOff – Application Crash#
The container starts and immediately exits. Kubernetes restarts it with exponential backoff.
# Get logs from the crashing container
kubectl logs deployment/my-app -n production --previous
# --previous shows logs from the last terminated container
# If there are multiple containers:
kubectl logs deployment/my-app -n production -c my-container --previous
# Check the exit code
kubectl get pods -l app=my-app -n production -o jsonpath='{.items[0].status.containerStatuses[0].lastState.terminated.exitCode}'
# Exit code 1: application error
# Exit code 137: OOMKilled (container exceeded memory limit)
# Exit code 139: segfaultCommon causes: missing environment variable, bad config, database migration failed, incompatible dependency version, memory limit too low for the new version.
ImagePullBackOff – Wrong Image or Auth Failure#
kubectl describe pod <pod-name> -n production | grep -A 5 "Events:"
# Look for: "Failed to pull image" or "unauthorized"Common causes: image tag does not exist (typo, build not pushed), registry credentials expired, private registry without imagePullSecrets configured.
Pending – Resource Shortage#
kubectl describe pod <pod-name> -n production | grep -A 10 "Events:"
# Look for: "Insufficient cpu", "Insufficient memory", "0/5 nodes are available"Common causes: new version requests more resources than available, node pool at maximum, resource quota exceeded.
Running but Errors in Logs#
The pods are up but the application is returning errors.
kubectl logs deployment/my-app -n production --tail=100 -f
# Look for connection refused, timeout, authentication failure, nil pointer, etc.Common causes: new code has a bug, configuration change broke connectivity to a dependency (database, external API), feature flag misconfiguration.
Step 3 – Decision: Rollback or Fix Forward?#
This is the critical decision point. Make it quickly.
Rollback when:
- The service is down or significantly degraded
- Users are actively impacted
- The fix is not immediately obvious
- The deployment introduced multiple changes (hard to isolate the problem)
- You are outside business hours and the on-call engineer is not the author of the change
Fix forward when:
- The issue is minor (e.g., a non-critical endpoint is broken, but core functionality works)
- The fix is obvious and can be applied in minutes (e.g., a missing environment variable)
- Rolling back would lose other critical changes that were bundled in the same deployment
- The broken version has already run a database migration that is not backward-compatible
Cost comparison:
- Rollback: near zero risk. Kubernetes keeps the old ReplicaSet with the previous pod spec. The rollback creates pods identical to what was running before.
- Fix forward: variable risk. You are debugging under pressure while the service is degraded. The fix might introduce new issues.
When in doubt, roll back. You can always deploy the fix later after proper analysis.
Step 4 – Execute the Rollback#
Standard Kubernetes Rollback#
# Roll back to the previous version
kubectl rollout undo deployment/my-app -n production
# Verify the rollback is progressing
kubectl rollout status deployment/my-app -n production
# Check pods are healthy
kubectl get pods -l app=my-app -n productionIf you need to go back further than one revision:
# View rollout history
kubectl rollout history deployment/my-app -n production
# Roll back to a specific revision
kubectl rollout undo deployment/my-app -n production --to-revision=5
# View what a specific revision contained
kubectl rollout history deployment/my-app -n production --revision=5Rollback with ArgoCD#
ArgoCD tracks the desired state in git. Rolling back means syncing to a previous commit.
# Option A: Sync to a specific git commit
argocd app sync my-app --revision <previous-commit-sha>
# Option B: Revert the commit in git (preferred -- keeps git history clean)
git revert HEAD
git push origin main
# ArgoCD auto-syncs to the reverted state within its polling interval (default 3 minutes)
# Option C: If auto-sync is disabled, manually sync
argocd app sync my-appRollback with Flux#
Flux reconciles from git. Revert the commit and Flux handles the rest.
git revert HEAD
git push origin main
# Flux reconciles within its interval (default 1 minute for source, 10 minutes for kustomization)
# Force immediate reconciliation
flux reconcile kustomization my-app --with-sourcePost-Rollback Verification#
# Confirm all pods are running the old version
kubectl get pods -l app=my-app -n production -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'
# Check rollout completed successfully
kubectl rollout status deployment/my-app -n production
# Should say: "deployment "my-app" successfully rolled out"
# Monitor error rates for 5-10 minutes to confirm recovery
# Prometheus: rate(http_requests_total{status=~"5.."}[5m]) should drop to baselineStep 5 – If Rollback Does Not Work#
Sometimes the rollback itself fails. This happens when:
- The old ReplicaSet has been garbage collected (Kubernetes keeps
revisionHistoryLimitrevisions, default 10) - The old version depended on a resource that no longer exists (deleted ConfigMap, expired secret, removed database table)
- A forward-only database migration makes the old code incompatible
When Revision History is Lost#
# Check if any revision history remains
kubectl rollout history deployment/my-app -n production
# If empty or only shows the current broken revision:
# Manually set the image to the known-good version
kubectl set image deployment/my-app -n production \
my-container=registry.example.com/my-app:1.2.3
# Or patch the full deployment spec
kubectl patch deployment my-app -n production --type='json' -p='[
{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value": "registry.example.com/my-app:1.2.3"}
]'When the Old Version Cannot Run#
If a database migration makes rollback impossible, you must fix forward:
# Scale the broken deployment to zero to stop error noise
kubectl scale deployment/my-app -n production --replicas=0
# Put up a maintenance page if you have one
kubectl scale deployment/maintenance-page -n production --replicas=1
# Fix the issue, build a new image, and deploy
kubectl set image deployment/my-app -n production \
my-container=registry.example.com/my-app:1.2.4-hotfix
kubectl scale deployment/my-app -n production --replicas=5Nuclear Option#
If nothing else works and you need the service running immediately:
# Delete the deployment entirely
kubectl delete deployment my-app -n production
# Recreate from a known-good manifest
kubectl apply -f /path/to/known-good-manifest.yaml
# Or from git
kubectl apply -f https://raw.githubusercontent.com/org/repo/last-known-good-commit/k8s/deployment.yamlThis loses rollout history but gets the service back. Operational priority is service recovery; historical cleanliness is secondary.
Step 6 – Post-Incident Analysis#
Once the service is stable, investigate what went wrong. Do not skip this step – without it, the same failure will recur.
Root Cause Investigation#
# Get the events from the failed rollout period
kubectl get events -n production --sort-by='.lastTimestamp' | head -50
# Examine the failed pod's full description
kubectl describe pod <failed-pod-name> -n production
# Check if the issue was in the container image itself
# Pull and inspect locally
docker pull registry.example.com/my-app:1.2.4
docker run --rm -it registry.example.com/my-app:1.2.4 /bin/sh
# Check configuration files, binary version, dependenciesQuestions to Answer#
- What was wrong with the new version? Code bug, configuration error, missing dependency, resource issue?
- Did the deployment strategy catch the issue? With a rolling update, old pods should have continued serving traffic. Did that happen?
- Could pre-deployment tests have caught this? Was there a readiness probe? Did it test the right thing?
- How long was the service degraded? From deployment start to rollback completion.
- Were there alerts? Did monitoring detect the problem before a human noticed?
Step 7 – Preventing Future Failed Deployments#
Add Readiness and Startup Probes#
If the new version crashes on startup, a readiness probe prevents it from receiving traffic:
spec:
containers:
- name: my-app
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 30 # 30 * 5s = 150s max startup timeConfigure Progressive Rollouts#
Use maxUnavailable: 0 so old pods are never killed until new pods are ready:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
minReadySeconds: 30 # Pod must be stable for 30s before considered availableImplement Canary Deployments#
For critical services, use Argo Rollouts or Flagger to automatically roll back if error rates increase:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
strategy:
canary:
steps:
- setWeight: 5
- pause: {duration: 2m}
- setWeight: 20
- pause: {duration: 5m}
- setWeight: 50
- pause: {duration: 5m}
analysis:
templates:
- templateName: error-rate-check
startingStep: 1
args:
- name: service-name
value: my-app
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-check
spec:
metrics:
- name: error-rate
interval: 60s
successCondition: result[0] < 0.05
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[2m]))
/
sum(rate(http_requests_total{service="{{args.service-name}}"}[2m]))This automatically aborts the rollout if the 5xx error rate exceeds 5%, rolling back to the stable version without human intervention.
CI Integration Tests#
Run integration tests against a staging environment that mirrors production before deploying. The CI pipeline should block the production deploy if staging tests fail. This catches configuration mismatches, dependency issues, and obvious code bugs before they reach production.
Feature Flags#
Decouple deployment from release. Deploy the code with the feature behind a flag, then enable the flag gradually. If the feature causes problems, disable the flag without a deployment. This removes the deployment itself as a failure vector for new features.