Blue-Green Deployments#

A blue-green deployment runs two identical production environments. One (blue) serves live traffic. The other (green) is idle or running the new version. When the green environment passes validation, you switch traffic from blue to green. If something goes wrong, you switch back. The old environment stays running until you are confident the new version is stable.

The fundamental advantage over rolling updates is atomicity. Traffic switches from 100% old to 100% new in a single operation. There is no period where some users see the old version and others see the new one.

Traffic Switching Mechanisms#

The mechanism you use to switch traffic determines your rollback speed, blast radius during the switch, and operational complexity.

DNS-Based Switching#

The simplest approach: update a DNS record to point from the blue environment’s IP to the green environment’s IP.

# Switch from blue (10.0.1.100) to green (10.0.2.100)
aws route53 change-resource-record-sets --hosted-zone-id Z123456 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "TTL": 60,
        "ResourceRecords": [{"Value": "10.0.2.100"}]
      }
    }]
  }'

DNS switching is simple but slow. Even with a 60-second TTL, DNS caches across the internet do not always respect TTL values. Some resolvers cache aggressively, some clients cache their own DNS lookups, and Java applications infamously cache DNS indefinitely unless explicitly configured otherwise. Expect a full traffic drain to take 5-15 minutes, sometimes longer. This makes DNS unsuitable when you need fast rollback.

Load Balancer Target Group Switching (AWS ALB)#

AWS ALB lets you swap target groups atomically. Both environments register with separate target groups, and you modify the listener rule to forward traffic to the new target group:

# Get current listener rule
aws elbv2 describe-rules --listener-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:listener/app/my-alb/abc123/def456

# Switch to green target group
aws elbv2 modify-rule --rule-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:listener-rule/app/my-alb/abc123/def456/rule123 \
  --actions Type=forward,TargetGroupArn=arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/green/xyz789

This switch is effectively instant for new connections. Existing connections continue to the old target group until they close naturally or the deregistration delay expires (default 300 seconds). Set the deregistration delay to match your application’s longest expected request duration.

Kubernetes Service Selector Switching#

In Kubernetes, the simplest blue-green switch updates the Service selector to point to a different set of pods:

# Blue deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
      version: blue
  template:
    metadata:
      labels:
        app: api
        version: blue
    spec:
      containers:
        - name: api
          image: registry.example.com/api:v1.0.0

---
# Green deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
      version: green
  template:
    metadata:
      labels:
        app: api
        version: green
    spec:
      containers:
        - name: api
          image: registry.example.com/api:v2.0.0

---
# Service - switch by changing selector
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  selector:
    app: api
    version: blue  # Change to "green" to switch
  ports:
    - port: 80
      targetPort: 8080

Switch traffic with a single patch:

# Switch to green
kubectl patch service api -p '{"spec":{"selector":{"version":"green"}}}'

# Rollback to blue
kubectl patch service api -p '{"spec":{"selector":{"version":"blue"}}}'

The switch takes effect within seconds as kube-proxy updates iptables/IPVS rules. This is fast but lacks connection draining – in-flight requests to old pods may receive connection resets.

Ingress Controller Switching#

NGINX Ingress Controller supports canary annotations that can also implement blue-green by setting the canary weight to 0 or 100:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "100"  # 100 = all traffic to green
spec:
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-green
                port:
                  number: 80

Service Mesh (Istio)#

Istio VirtualService provides the most control. You can shift traffic with percentage-based weights and add header-based routing for pre-switch testing:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api
spec:
  hosts:
    - api.example.com
  http:
    - match:
        - headers:
            x-preview:
              exact: "green"
      route:
        - destination:
            host: api-green
            port:
              number: 80
    - route:
        - destination:
            host: api-blue
            port:
              number: 80
          weight: 0
        - destination:
            host: api-green
            port:
              number: 80
          weight: 100

The header match lets you test the green environment with specific requests before switching all traffic. Istio also handles connection draining properly during the switch.

Database Compatibility#

Database schema changes are the hardest part of blue-green deployments. Both environments must be able to work with the same database during and after the switch.

The expand-contract pattern is mandatory. Never deploy a schema change that breaks the blue environment in the same release as the code that uses it.

Phase 1 – Expand: Add new columns, tables, or indexes. Do not remove or rename anything. Both old and new code works with the expanded schema.

-- Safe: additive change
ALTER TABLE orders ADD COLUMN shipping_method varchar(50) DEFAULT 'standard';

Phase 2 – Deploy new code: Switch traffic to green. The new code uses the new columns. The old code ignores them.

Phase 3 – Contract: After the blue environment is decommissioned and you are confident the rollback window has passed, remove the old columns.

-- Only after blue is fully retired
ALTER TABLE orders DROP COLUMN legacy_shipping_flag;

Destructive changes (dropping columns, renaming tables, changing column types) in the expand phase will break the blue environment and eliminate your rollback path.

Session Draining#

When switching traffic, in-flight requests need to complete gracefully. Without draining, users mid-transaction receive errors.

For Kubernetes, configure terminationGracePeriodSeconds on the pod spec and implement a preStop hook:

spec:
  terminationGracePeriodSeconds: 60
  containers:
    - name: api
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 10"]

For AWS ALB, the deregistration delay setting controls how long the load balancer waits for in-flight requests to complete before forcefully closing connections. Set this to at least the duration of your longest expected request.

Sticky sessions (session affinity) add complexity. If users have sessions pinned to the blue environment, switching to green drops their sessions. Options: store sessions externally (Redis, DynamoDB), use cookie-based sessions that both environments can decrypt, or accept a brief session disruption during the switch.

Rollback Speed Comparison#

Mechanism Switch Time Rollback Time Connection Draining
DNS 5-15 min 5-15 min None
ALB Target Group Instant (new) Instant (new) Built-in
K8s Service Selector 2-5 sec 2-5 sec None
Istio VirtualService Sub-second Sub-second Built-in

Blue-Green vs. Canary#

Blue-green switches all traffic at once. Canary gradually shifts traffic over time. The right choice depends on your priorities:

Choose blue-green when: you need instant rollback, your testing is thorough enough to validate before switching, you cannot tolerate inconsistent user experiences (some users on old, some on new), or you are deploying database changes that require all-or-nothing application code changes.

Choose canary when: you want to detect issues with real production traffic before full rollout, you have good observability and automated metrics analysis, your changes are incremental and do not require schema changes, or your user base is large enough that even 5% canary traffic provides meaningful signal.

Choose both together: deploy blue-green for the infrastructure switch, but use a canary-like phased approach within the green environment. Start by routing internal traffic to green for smoke testing, then switch external traffic. This gives you the atomicity of blue-green with the safety of pre-validation.

Cost Consideration#

Blue-green requires double the infrastructure during deployments. For cloud environments with on-demand pricing, this is a temporary cost spike. For on-premises infrastructure, it means provisioning and maintaining twice the capacity permanently. Many teams mitigate this by using the idle environment for pre-production testing, batch processing, or disaster recovery standby.