Canary Deployments Deep Dive#

A canary deployment sends a small percentage of traffic to a new version of your application while the majority continues hitting the stable version. You monitor the canary for errors, latency regressions, and business metric anomalies. If the canary is healthy, you gradually increase its traffic share until it handles 100%. If something is wrong, you roll back with minimal user impact.

Why Canary Over Rolling Update#

A standard Kubernetes rolling update replaces pods one by one until all pods run the new version. The problem is timing. By the time you notice a bug in your monitoring dashboards, the rolling update may have already replaced most or all pods. Every user is now hitting the broken version.

A canary deployment starts at 5% or 10% of traffic. If the new version has a memory leak, crashes under load, or produces incorrect responses, only that small slice of users is affected. You have minutes to observe the canary before deciding to proceed or abort. This is the fundamental difference: rolling updates detect problems after full deployment, while canary deployments detect problems during deployment.

Argo Rollouts#

Argo Rollouts replaces the standard Kubernetes Deployment resource with a Rollout resource that natively understands canary and blue-green deployment strategies.

Basic Rollout Resource#

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-server
spec:
  replicas: 5
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
        - name: api-server
          image: registry.example.com/api-server:v2.1.0
          ports:
            - containerPort: 8080
  strategy:
    canary:
      canaryService: api-server-canary
      stableService: api-server-stable
      trafficRouting:
        nginx:
          stableIngress: api-server-ingress
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - setWeight: 25
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100

This configuration routes 10% of traffic to the canary, waits five minutes, increases to 25%, waits again, increases to 50%, waits ten minutes, and then promotes to 100%. During each pause, you can observe metrics manually or use automated analysis.

Argo Rollouts works with nginx ingress, Istio, AWS ALB, Traefik, and the Kubernetes Gateway API for traffic splitting. Each integration has its own trafficRouting configuration, but the step definitions remain the same.

AnalysisTemplate for Automated Decisions#

Manual observation does not scale. AnalysisTemplates define success criteria that Argo Rollouts evaluates automatically:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: error-rate
      interval: 60s
      count: 5
      successCondition: result[0] < 0.01
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",status=~"5.."}[2m]))
            /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[2m]))
    - name: p99-latency
      interval: 60s
      count: 5
      successCondition: result[0] < 0.5
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{service="{{args.service-name}}"}[2m]))
              by (le)
            )

This template checks two conditions every 60 seconds for 5 iterations: the error rate must be below 1%, and p99 latency must be below 500ms. If either metric fails more than twice, the rollout aborts automatically.

Wire the analysis into your rollout steps:

steps:
  - setWeight: 10
  - analysis:
      templates:
        - templateName: success-rate
      args:
        - name: service-name
          value: api-server-canary
  - setWeight: 50
  - pause: { duration: 10m }
  - setWeight: 100

After setting traffic to 10%, the analysis runs. If it passes, traffic increases to 50%. If it fails, the rollout automatically reverts to the stable version.

Experiments#

Argo Rollouts supports experiments that run the new version alongside stable without shifting production traffic. An experiment creates temporary ReplicaSets and AnalysisRuns, compares metrics between the two versions, and reports whether the new version is safe to promote. This is useful for testing changes that are high-risk or difficult to revert.

Flagger#

Flagger takes a different approach. Instead of replacing the Deployment resource, Flagger works alongside existing Deployments. You create a Canary custom resource that references your Deployment, and Flagger manages the canary process automatically:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-server
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  service:
    port: 8080
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m
    webhooks:
      - name: smoke-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -s http://api-server-canary:8080/health | grep ok"
      - name: load-test
        type: rollout
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          type: cmd
          cmd: "hey -z 1m -q 10 -c 2 http://api-server-canary:8080/"

When you update the Deployment’s image tag, Flagger detects the change and begins the canary process. It starts at 0% canary traffic, runs the pre-rollout webhook (smoke test), then increments traffic by 10% every minute until reaching 50%. At each step, it checks that the request success rate stays above 99% and request duration stays below 500ms.

Flagger supports Istio, Linkerd, nginx, Contour, Gloo, and the Gateway API as traffic routing providers. The built-in metrics (request-success-rate, request-duration) work automatically with Istio and Linkerd. For nginx ingress, you configure custom Prometheus queries.

Webhooks#

Flagger webhooks enable pre-rollout smoke tests, rollout load tests, and post-rollout notifications. The pre-rollout webhook runs before any traffic is shifted – if it fails, the canary never receives traffic. The rollout webhook runs during canary analysis, which is the right place for load testing to ensure the canary has enough traffic for meaningful metrics.

Choosing Between Argo Rollouts and Flagger#

Argo Rollouts offers more features. It supports both canary and blue-green strategies, has a built-in UI that integrates with ArgoCD, and provides fine-grained control over each step with pause, analysis, and experiment blocks. The tradeoff is that it replaces the Deployment resource with its own CRD, which means existing tooling that targets Deployments needs adjustment.

Flagger works with existing Deployments without modification. It integrates tightly with Flux for GitOps workflows and supports more ingress providers out of the box. It is a better fit if you want canary deployments without changing your resource definitions or if your GitOps pipeline is built around Flux.

Both tools use Prometheus for metrics analysis. Both support automated rollback. The decision usually comes down to your existing toolchain: ArgoCD shops lean toward Argo Rollouts, Flux shops lean toward Flagger.

Metrics for Canary Analysis#

Effective canary analysis requires the right metrics:

Error rate is the primary signal. Calculate it as HTTP 5xx responses divided by total responses. An error rate threshold of 1% is common, though critical services may use 0.1%.

Latency catches performance regressions. Use p99 (99th percentile) rather than average, because averages hide tail latency spikes that affect real users. A p99 threshold of 500ms is a reasonable starting point, adjusted to your service’s SLA.

Saturation (CPU and memory usage) detects resource leaks that may not immediately manifest as errors or latency. If the canary’s memory usage is climbing steadily while stable is flat, something is wrong even if requests are succeeding.

Business metrics are the ultimate measure. Order completion rate, payment success rate, search result quality – these catch bugs that pass infrastructure-level metrics. If the canary returns HTTP 200 but with incorrect data, error rate and latency look fine while users see broken functionality.

Common Gotchas#

Database migrations and canary deployments conflict. During a canary deployment, both the old and new versions of your application run simultaneously. If the new version requires a schema change, you must use the expand-contract pattern: first deploy a migration that adds new columns or tables without removing old ones (expand), then deploy the new application code, then remove the old columns after the old version is fully retired (contract). Running a destructive migration while the old version still needs the old schema causes failures.

Session affinity creates inconsistent user experiences. If a user’s requests are not pinned to either canary or stable, they may see the new version on one request and the old version on the next. This is especially problematic for UI changes where the page flickers between old and new designs. Use consistent hashing on user ID or session cookie to ensure each user sees only one version.

Insufficient traffic invalidates metrics. A canary receiving 10% of traffic for a service that handles 10 requests per minute means 1 request per minute to the canary. A single failed request looks like a 100% error rate. Canary analysis needs statistical significance. If your service has low traffic, extend the analysis interval, lower the canary percentage, or use synthetic load testing to supplement real traffic.

Rollback does not mean zero impact. Even with fast automated rollback, users who hit the canary during the bad period experienced errors. Canary deployments reduce blast radius, they do not eliminate it. Pair canary deployments with feature flags for an additional layer of safety – deploy the code behind a flag, then enable the flag through the canary process.