Why Break Things on Purpose#

Production systems fail in ways that testing environments never reveal. A database connection pool exhaustion under load, a cascading timeout across three services, a DNS cache that masks a routing change until it expires – these failures only surface when real conditions collide in ways nobody predicted. Chaos engineering is the discipline of deliberately injecting failures into a system to discover weaknesses before they cause outages.

The key principle: chaos engineering is not random destruction. It is hypothesis-driven experimentation. You predict what should happen when a specific failure occurs, inject that failure in a controlled way, observe the actual behavior, and fix any gap between prediction and reality.

Step 1: Establish Your Steady State#

Before breaking anything, you must define what “working” looks like in measurable terms. The steady state is a set of metrics that indicate normal system behavior. These should be the same metrics backing your SLOs:

  • Request success rate (e.g., 99.9% of requests return 2xx)
  • p99 latency (e.g., under 500ms)
  • Queue depth or processing lag (e.g., consumer lag under 1000 messages)
  • Error rate by type (e.g., fewer than 5 5xx errors per minute)

Document the steady state before the experiment. If you cannot define what “normal” looks like, you are not ready for chaos engineering – start with observability first.

Steady State Hypothesis for payment-api:
- Availability: > 99.9% success rate (measured at load balancer)
- Latency: p99 < 500ms (measured at application)
- Downstream errors: < 0.1% of requests fail due to payment-provider timeout
- Recovery: Service returns to steady state within 60 seconds of fault removal

Step 2: Design the Experiment#

A chaos experiment has five components:

  1. Hypothesis: “When we kill one of three payment-api pods, the remaining two pods absorb traffic with no user-visible errors and latency stays under 500ms p99.”
  2. Method: Kill one pod using kubectl delete pod.
  3. Blast radius: Only the payment-api service in the staging environment. No other services affected.
  4. Abort conditions: If error rate exceeds 1% or p99 latency exceeds 2 seconds, stop immediately.
  5. Rollback: The killed pod will be recreated by the Deployment controller. If it is not recreated within 60 seconds, manually scale the deployment.

Start with the smallest possible experiment. Your first chaos experiment should not be “take down an availability zone.” It should be “kill one pod of a non-critical service in staging.”

Progression of experiment severity:

Level 1: Kill a single pod (tests Kubernetes self-healing)
Level 2: Inject network latency to a dependency (tests timeout configuration)
Level 3: Block network access to a dependency (tests circuit breakers)
Level 4: Fill a disk or exhaust memory (tests resource limits)
Level 5: Kill an entire node (tests pod anti-affinity and rescheduling)
Level 6: Simulate AZ failure (tests multi-AZ redundancy)

Step 3: Choose Your Tools#

Chaos Mesh (Kubernetes-native)#

Chaos Mesh runs as a set of controllers in your Kubernetes cluster. It uses Custom Resource Definitions to define experiments, making them declarative and version-controllable.

Install with Helm:

helm repo add chaos-mesh https://charts.chaos-mesh.org
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh --create-namespace \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock

Kill a pod experiment:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-payment-api
  namespace: chaos-mesh
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - staging
    labelSelectors:
      app: payment-api
  duration: "30s"

Inject network latency:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay-to-database
  namespace: chaos-mesh
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - staging
    labelSelectors:
      app: payment-api
  delay:
    latency: "200ms"
    jitter: "50ms"
    correlation: "25"
  direction: to
  target:
    selector:
      namespaces:
        - staging
      labelSelectors:
        app: postgres
  duration: "2m"

Simulate I/O stress:

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-stress-payment-api
  namespace: chaos-mesh
spec:
  mode: one
  selector:
    namespaces:
      - staging
    labelSelectors:
      app: payment-api
  stressors:
    memory:
      workers: 1
      size: "256MB"
  duration: "1m"

Litmus Chaos#

Litmus uses ChaosEngine and ChaosExperiment resources. It includes a hub of pre-built experiments and supports automated steady-state validation.

Install the operator:

helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm
helm install litmus litmuschaos/litmus \
  --namespace litmus --create-namespace

Run a pod-delete experiment:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-api-chaos
  namespace: staging
spec:
  appinfo:
    appns: staging
    applabel: app=payment-api
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: FORCE
              value: "false"
        probe:
          - name: check-availability
            type: httpProbe
            httpProbe/inputs:
              url: http://payment-api.staging/health
              method:
                get:
                  criteria: ==
                  responseCode: "200"
            mode: Continuous
            runProperties:
              probeTimeout: 5s
              interval: 2s

The httpProbe automatically validates steady state during the experiment. If the health check fails, Litmus records the experiment as failed.

Chaos Monkey (Netflix)#

Chaos Monkey randomly terminates virtual machine instances in production. It is part of the Netflix Simian Army. For Kubernetes environments, kube-monkey is the equivalent:

helm install kube-monkey kubemonkey/kube-monkey \
  --namespace kube-system \
  --set config.dryRun=false \
  --set config.runHour=10 \
  --set config.startHour=10 \
  --set config.endHour=16 \
  --set config.timeZone=America/New_York

Opt deployments in with labels:

metadata:
  labels:
    kube-monkey/enabled: "enabled"
    kube-monkey/identifier: "payment-api"
    kube-monkey/mtbf: "3"           # mean time between failures in days
    kube-monkey/kill-mode: "fixed"
    kube-monkey/kill-value: "1"     # kill 1 pod at a time

Kube-monkey is less precise than Chaos Mesh or Litmus – it is random by design. Use it for ongoing resilience validation after you have fixed the issues found by targeted experiments.

Step 4: Control the Blast Radius#

Blast radius control is what separates chaos engineering from chaos. Every experiment must have boundaries.

Namespace isolation: Run experiments in staging or a dedicated chaos namespace first. Never start with production.

Label targeting: Target specific services, not entire namespaces. Use label selectors to restrict which pods are affected.

Duration limits: Always set a duration. An experiment without a timeout is a production incident.

Abort conditions: Define what triggers an immediate halt. Monitor SLIs during the experiment and abort if they cross thresholds:

# Chaos Mesh schedule with pause
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: weekly-pod-chaos
spec:
  schedule: "0 14 * * 3"  # Wednesday at 2 PM
  type: PodChaos
  podChaos:
    action: pod-kill
    mode: one
    selector:
      namespaces:
        - staging
      labelSelectors:
        app: payment-api
    duration: "30s"
  concurrencyPolicy: Forbid

Progressive scope: Start narrow and expand only after confidence builds:

  1. One pod of one service in staging
  2. Multiple pods of one service in staging
  3. One pod of one service in production (during business hours with full team awareness)
  4. Cross-service experiments in staging
  5. Node-level experiments in staging
  6. Production experiments with automated abort conditions

Step 5: Run the Experiment#

Before the experiment:

  • Notify the team. Everyone should know chaos is happening.
  • Open your monitoring dashboards. Watch SLIs in real time.
  • Confirm the abort procedure. Who hits the stop button, and how?
  • Verify the experiment targets the correct resources.

During the experiment:

  • Watch the steady state metrics continuously.
  • Note any unexpected behavior, even if metrics stay within bounds.
  • If abort conditions are met, stop immediately. Do not push through.

After the experiment:

  • Document the results: did the system behave as hypothesized?
  • If the hypothesis held, increase scope or severity next time.
  • If the hypothesis failed, you found a real weakness. File a bug, fix it, and re-run the experiment after the fix.

Step 6: Build a GameDay Practice#

GameDays are scheduled sessions where the team runs chaos experiments together. They serve two purposes: validating system resilience and training the team to respond to failures.

A GameDay agenda:

09:00 - Brief: Today's experiments, hypotheses, abort conditions
09:15 - Experiment 1: Kill payment-api pod (expected: no user impact)
09:30 - Experiment 2: 500ms latency to database (expected: p99 rises, no errors)
09:45 - Experiment 3: Block network to cache layer (expected: fallback to database, slower responses)
10:00 - Experiment 4: Exhaust payment-api memory (expected: OOM kill, pod restart, brief error spike)
10:30 - Debrief: Results, surprises, action items

Run GameDays quarterly at minimum. Monthly is better. Rotate who designs the experiments so different team members develop chaos engineering skills.

Maturity Model#

Level 1 – Ad hoc: Manual kubectl delete pod in staging. No steady state definition. Results communicated verbally.

Level 2 – Defined: Chaos Mesh or Litmus installed. Experiments are YAML files in version control. Steady state documented. Results tracked in a shared doc.

Level 3 – Automated: Experiments run on a schedule in staging. Automated abort conditions based on SLI thresholds. Results feed into reliability dashboards.

Level 4 – Production: Targeted experiments run in production during business hours. Blast radius is tightly controlled. The team trusts the abort mechanisms.

Level 5 – Continuous: Chaos experiments run continuously in production. Kube-monkey or equivalent randomizes failures. New services are automatically included after passing production readiness review. Experiment results drive SLO adjustments.

Most teams should aim for Level 3 within six months and Level 4 within a year. Level 5 requires deep organizational trust in the tooling and process.