Scenario: Preparing for and Handling a Traffic Spike#

You are helping when someone says: “we have a big launch next week,” “Black Friday is coming,” or “traffic is suddenly 3x normal and climbing.” These are two distinct problems – proactive preparation for a known event and reactive response to an unexpected surge – but they share the same infrastructure mechanics.

The key principle: Kubernetes autoscaling has latency. HPA takes 15-30 seconds to detect increased load and scale pods. Cluster Autoscaler takes 3-7 minutes to provision new nodes. If your traffic spike is faster than your scaling speed, users hit errors during the gap. Proactive preparation eliminates this gap. Reactive response minimizes it.

Part A – Proactive Preparation (Known Upcoming Spike)#

You know the traffic is coming. Maybe it is a product launch, a marketing campaign, a seasonal event, or a planned load test. You have days or weeks to prepare.

Step 1 – Capacity Assessment#

Start by understanding where you are and where you need to be.

# Current pod counts and resource usage
kubectl top pods -n production --sort-by=cpu
kubectl get hpa -n production

# Current node capacity and usage
kubectl top nodes
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
CPU_CAPACITY:.status.capacity.cpu,\
MEM_CAPACITY:.status.capacity.memory,\
CPU_ALLOC:.status.allocatable.cpu,\
MEM_ALLOC:.status.allocatable.memory

# Current requests per second (baseline)
sum(rate(http_requests_total[5m]))

# Current p99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# CPU headroom: how much of requests are actually used
1 - (
  sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m]))
  /
  sum(kube_pod_container_resource_requests{namespace="production", resource="cpu"})
)

Now estimate the spike. A 2x spike is manageable with autoscaling alone. A 5x spike needs pre-scaling. A 10x spike needs pre-scaling plus architectural changes (caching, CDN, read replicas).

Identify Bottlenecks#

The application tier is rarely the only bottleneck. Walk through the full request path:

Ingress controller: does it have enough replicas? NGINX workers per pod?
Application pods: can they scale fast enough? What is the startup time?
Database: connection pool exhaustion is the most common failure point. If you have 10 pods with a pool of 20 connections each, and you scale to 50 pods, that is 1,000 connections against a database configured for 200.
External APIs: do they have rate limits you will hit at higher traffic?
Cache: if cache is cold after scaling, the database takes the full load during warm-up.

Step 2 – Pre-Scale#

Do not rely on autoscaling to respond in time. Pre-warm the infrastructure before the event.

# Increase HPA minimums to pre-warm pod count
# Current: minReplicas=3, you expect 5x traffic, so set minReplicas to 15
kubectl patch hpa my-app -n production \
  -p '{"spec":{"minReplicas":15}}'

# Pre-scale the ingress controller too
kubectl patch hpa ingress-nginx-controller -n ingress-system \
  -p '{"spec":{"minReplicas":4}}'

# Pre-scale the node pool (cloud-specific)
# AWS EKS:
aws eks update-nodegroup-config \
  --cluster-name production \
  --nodegroup-name main-pool \
  --scaling-config minSize=10,maxSize=30,desiredSize=15

# GKE:
gcloud container clusters resize production \
  --node-pool main-pool \
  --num-nodes 15

# AKS:
az aks nodepool scale \
  --resource-group my-rg \
  --cluster-name production \
  --name mainpool \
  --node-count 15

Pre-scale dependencies too:

# Database: increase connection pool or add read replicas
# Redis: scale up if using Redis for sessions or caching
kubectl scale statefulset redis -n production --replicas=3

# If using a managed database, increase instance size or add read replicas
# AWS RDS example:
aws rds create-db-instance-read-replica \
  --db-instance-identifier production-read-1 \
  --source-db-instance-identifier production-primary

Step 3 – Verify Scaling Works#

Never trust autoscaling configuration without testing it. Run a load test at the expected traffic level.

# Using k6 for load testing
cat > loadtest.js << 'EOF'
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // ramp up
    { duration: '5m', target: 500 },   // hold at 5x
    { duration: '2m', target: 1000 },  // push to 10x
    { duration: '5m', target: 1000 },  // hold at 10x
    { duration: '3m', target: 0 },     // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(99)<500'],   // 99th percentile under 500ms
    http_req_failed: ['rate<0.01'],     // less than 1% errors
  },
};

export default function () {
  const res = http.get('https://my-app.example.com/api/health');
  check(res, { 'status 200': (r) => r.status === 200 });
  sleep(0.1);
}
EOF

k6 run loadtest.js

While the test runs, watch scaling behavior in a separate terminal:

# Watch pod scaling
kubectl get pods -l app=my-app -n production -w

# Watch HPA decisions
kubectl describe hpa my-app -n production

# Watch node provisioning
kubectl get nodes -w

# Watch for scheduling failures
kubectl get events -n production --field-selector reason=FailedScheduling -w

What to look for:

Pod startup time: If pods take 60 seconds to start and pass readiness, there is a 60-second window where scaling cannot keep up with a sudden spike.
Node provisioning time: Cluster Autoscaler typically takes 3-5 minutes to provision new nodes. During this window, pods sit in Pending state.
Cascade failures: scaling the app may overload the database, which causes app errors, which causes retries, which makes everything worse.

Step 4 – During the Spike#

Even with preparation, monitor actively during the event.

# Key metrics to watch
# Pod count and HPA status
kubectl get hpa -n production -w

# Error rate (should stay below threshold)
# Prometheus: sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))

# Latency (should stay within SLA)
# Prometheus: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[1m])) by (le))

# Node utilization
kubectl top nodes

Be ready for manual intervention:

# If HPA is scaling too slowly, manually increase replicas
kubectl scale deployment my-app -n production --replicas=30

# If nodes are full and autoscaler is slow, add nodes manually
# AWS:
aws autoscaling set-desired-capacity \
  --auto-scaling-group-name eks-main-pool-xxxx \
  --desired-capacity 20

# If database connections are exhausting, scale down non-critical workloads
kubectl scale deployment analytics-worker -n production --replicas=0

Watch for these failure modes during the spike:

OOMKilled pods: memory limit is too low for the traffic volume (more connections = more memory). Increase the limit.
Database connection exhaustion: more pods than the database can handle. Solution: reduce per-pod connection pool size, add PgBouncer or ProxySQL, or add read replicas.
Rate limiting from external services: third-party APIs returning 429s. Solution: implement circuit breakers, add caching, or request rate limit increases in advance.

Step 5 – After the Spike#

Do not scale down immediately. Traffic often has a long tail, and aggressive scale-down followed by another surge causes worse instability than maintaining extra capacity for a few hours.

# Reduce HPA minimums gradually over hours/days
# Day of event: minReplicas=15
# Day after: minReplicas=8
# Two days after: minReplicas=3 (back to normal)
kubectl patch hpa my-app -n production \
  -p '{"spec":{"minReplicas":8}}'

# Reduce node pool minimum
aws eks update-nodegroup-config \
  --cluster-name production \
  --nodegroup-name main-pool \
  --scaling-config minSize=5,maxSize=30,desiredSize=8

Post-event analysis:

What was the actual peak traffic? How did it compare to the estimate?
Did autoscaling engage? At what point?
Were there any errors during the ramp-up period?
What was the bottleneck? (usually the database)
How much did the event cost in extra infrastructure?

Part B – Reactive Response (Unexpected Spike)#

Traffic is surging right now and you did not plan for it. Time is critical.

Step 1 – Immediate Assessment (First 2 Minutes)#

# Is HPA already scaling?
kubectl describe hpa my-app -n production
# Look for "ScalingActive" condition and recent events
# "New size: 8; reason: cpu resource utilization above target" = HPA is working
# "ScalingLimited" = HPA hit maxReplicas and cannot scale further

# Are pods in Pending state (node capacity exhausted)?
kubectl get pods -n production --field-selector=status.phase=Pending

# What is the current error rate?
kubectl logs deployment/my-app -n production --tail=20

Step 2 – Immediate Scaling Actions#

# If HPA exists but maxReplicas is too low, increase it
kubectl patch hpa my-app -n production \
  -p '{"spec":{"maxReplicas":50}}'

# If no HPA exists, manually scale
kubectl scale deployment my-app -n production --replicas=20

# If nodes are full, increase node pool size
# AWS:
aws autoscaling set-desired-capacity \
  --auto-scaling-group-name eks-main-pool-xxxx \
  --desired-capacity 15

# GKE:
gcloud container clusters resize production \
  --node-pool main-pool \
  --num-nodes 15 --quiet

Step 3 – Protect the System (Graceful Degradation)#

If the system cannot handle the full load even after scaling, shed non-critical load to protect core functionality.

# Enable rate limiting at ingress level
# If using NGINX Ingress, apply rate limiting annotation
kubectl annotate ingress my-app -n production \
  nginx.ingress.kubernetes.io/limit-rps="100" \
  nginx.ingress.kubernetes.io/limit-connections="50" \
  --overwrite

# Scale down non-critical workloads to free resources
kubectl scale deployment analytics-worker -n production --replicas=0
kubectl scale deployment report-generator -n production --replicas=0
kubectl scale deployment email-sender -n production --replicas=1

If the database is the bottleneck (the most common case during unexpected spikes):

# Reduce per-pod connection pool size to stop connection exhaustion
# This requires a config change or environment variable update
kubectl set env deployment/my-app -n production \
  DB_POOL_SIZE=5  # Down from default of 20

# If you have a connection pooler like PgBouncer, scale it
kubectl scale deployment pgbouncer -n production --replicas=3

# Enable read replicas if available
kubectl set env deployment/my-app -n production \
  DB_READ_HOST=db-read-replica.internal

Step 4 – Scale Infrastructure Layer#

If the Cluster Autoscaler is not fast enough (it takes 3-7 minutes), manually provision capacity:

# Check Cluster Autoscaler status
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

# If autoscaler is active but slow, increase node pool desired size directly
# This bypasses the autoscaler's deliberation time

# Check pending pods to understand how much capacity is needed
kubectl get pods --all-namespaces --field-selector=status.phase=Pending -o json | \
  jq '[.items[] | .spec.containers[].resources.requests] | {
    total_cpu: (map(.cpu // "0") | length),
    total_pods: length
  }'

HPA Configuration for Spike Readiness#

For services that regularly face traffic spikes, configure HPA to scale up aggressively and scale down conservatively:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60  # Lower target = earlier scaling
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0    # Scale up immediately, no waiting
      policies:
      - type: Percent
        value: 100                     # Double pod count per 60s if needed
        periodSeconds: 60
      - type: Pods
        value: 10                      # Or add up to 10 pods per 60s
        periodSeconds: 60
      selectPolicy: Max                # Use whichever allows faster scaling
    scaleDown:
      stabilizationWindowSeconds: 600  # Wait 10 minutes before scaling down
      policies:
      - type: Percent
        value: 10                      # Remove at most 10% per 60s
        periodSeconds: 60
      selectPolicy: Min                # Use whichever is more conservative

Key choices explained:

60% CPU target instead of 70-80%: leaves headroom for traffic bursts within the HPA check interval. If you target 80% and traffic doubles between checks, pods hit 160% (throttling) before HPA reacts.
stabilizationWindowSeconds: 0 for scale-up: react immediately to increased load. Never delay scaling up.
stabilizationWindowSeconds: 600 for scale-down: avoid thrashing. A 10-minute window means the HPA waits until the load has been consistently lower for 10 minutes before removing pods. This handles bursty traffic patterns.
selectPolicy: Max for scale-up, Min for scale-down: asymmetric aggressiveness. Scale up as fast as possible, scale down as slowly as reasonable.

Monitoring Queries for Traffic Spikes#

These Prometheus queries give you the critical signals during a traffic event:

# Current requests per second vs 1 hour ago
sum(rate(http_requests_total[1m]))
/
sum(rate(http_requests_total[1m] offset 1h))
# Result > 2 means traffic has doubled

# Error rate (should stay below 1%)
sum(rate(http_requests_total{status=~"5.."}[1m]))
/
sum(rate(http_requests_total[1m]))

# p99 latency trend
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[1m])) by (le))

# Pod count vs HPA max (how close to ceiling)
kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler="my-app"}
/
kube_horizontalpodautoscaler_spec_max_replicas{horizontalpodautoscaler="my-app"}
# Result > 0.8 means you are approaching the HPA ceiling

# Node allocatable vs requested (cluster headroom)
1 - (
  sum(kube_pod_container_resource_requests{resource="cpu"})
  /
  sum(kube_node_status_allocatable{resource="cpu"})
)
# Result < 0.1 means less than 10% cluster headroom -- new pods may go Pending

Set up alerts on these metrics before any planned event:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: traffic-spike-alerts
  namespace: monitoring
spec:
  groups:
  - name: traffic-spike
    rules:
    - alert: HPANearMaxReplicas
      expr: |
        kube_horizontalpodautoscaler_status_current_replicas
        /
        kube_horizontalpodautoscaler_spec_max_replicas > 0.8
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "HPA {{ $labels.horizontalpodautoscaler }} is at {{ $value | humanizePercentage }} of max replicas"
    - alert: HighErrorRateDuringSpike
      expr: |
        sum(rate(http_requests_total{status=~"5.."}[2m]))
        /
        sum(rate(http_requests_total[2m])) > 0.02
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Error rate is {{ $value | humanizePercentage }} during traffic spike"
    - alert: ClusterCapacityLow
      expr: |
        1 - (
          sum(kube_pod_container_resource_requests{resource="cpu"})
          /
          sum(kube_node_status_allocatable{resource="cpu"})
        ) < 0.15
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Cluster CPU headroom is only {{ $value | humanizePercentage }}"

The difference between surviving a traffic spike and going down during one usually comes down to preparation. Load test your scaling behavior before you need it. Know your bottleneck before traffic finds it for you.