Pipeline Observability#

You cannot improve what you do not measure. Most teams have detailed monitoring for their production applications but treat their CI/CD pipelines as black boxes. When builds are slow, flaky, or failing, the response is anecdotal – “builds feel slow lately” – rather than data-driven. Pipeline observability turns CI/CD from a cost center you tolerate into infrastructure you actively manage.

Core CI/CD Metrics#

Build Duration#

Total time from pipeline trigger to completion. Track this as a histogram, not an average, because averages hide bimodal distributions. A pipeline that takes 5 minutes for code-only changes and 25 minutes for dependency updates averages 15 minutes, which describes neither case accurately.

Break duration into segments: checkout time, dependency install, compilation, test execution, artifact upload, deployment. This reveals which phase to optimize. A pipeline spending 8 of 12 minutes downloading dependencies has a caching problem, not a test speed problem.

Queue Time#

Time between a pipeline being triggered and a runner picking it up. High queue time means insufficient runner capacity. Track this separately from build duration because they have different solutions: build duration requires pipeline optimization, queue time requires more runners or better scheduling.

# Queue time by runner label (Prometheus)
histogram_quantile(0.95,
  sum(rate(ci_job_queue_duration_seconds_bucket{runner_type="self-hosted"}[1h]))
  by (le, runner_label)
)

Failure Rate#

Percentage of pipeline runs that fail. Break this down by failure type: test failure, infrastructure failure (runner crash, OOM, network timeout), configuration error, flaky test. Infrastructure failures are your responsibility; test failures are the developer’s responsibility. Conflating them produces a metric nobody owns.

# Failure rate by branch type
sum(rate(ci_pipeline_completed_total{status="failed"}[24h])) by (branch_type)
/
sum(rate(ci_pipeline_completed_total[24h])) by (branch_type)

Flaky Test Rate#

Percentage of test failures that pass on retry. Flaky tests erode developer trust in CI. When developers stop trusting test results, they start ignoring failures, and real bugs slip through. Track flakiness per test, not per pipeline, so you can identify and fix the worst offenders.

Mean Time to Recovery (MTTR)#

Time from a pipeline failure to the next successful run on the same branch. High MTTR indicates that failures are hard to diagnose or fix. This metric incentivizes investing in clear error messages, good test output formatting, and fast feedback loops.

DORA Metrics#

The four DORA (DevOps Research and Assessment) metrics measure software delivery performance:

Deployment Frequency – How often you deploy to production
Lead Time for Changes – Time from commit to production deployment
Change Failure Rate – Percentage of deployments that cause a production incident
Mean Time to Restore – Time from incident detection to resolution

Collecting DORA Metrics#

DORA metrics span multiple systems (source control, CI/CD, incident management), so collection requires stitching data together.

Deployment Frequency: Count production deployment events. In GitHub Actions, track successful completions of your deploy-to-production workflow:

# In your deploy workflow, emit a metric on success
- name: Record deployment
  if: success()
  run: |
    curl -X POST https://metrics.internal/api/v1/write \
      -H "Content-Type: application/json" \
      -d '{
        "metric": "deployments_total",
        "labels": {"env": "production", "service": "${{ github.repository }}"},
        "value": 1,
        "timestamp": "'$(date +%s)'"
      }'

Lead Time for Changes: Calculate the time between the first commit in a PR and the production deployment that includes it. This requires correlating git commit timestamps with deployment timestamps:

-- Query for lead time (assuming you store events in a database)
SELECT
  AVG(deployed_at - first_commit_at) AS avg_lead_time,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY deployed_at - first_commit_at) AS median_lead_time
FROM deployments d
JOIN merge_requests mr ON d.merge_request_id = mr.id
WHERE d.environment = 'production'
  AND d.deployed_at > NOW() - INTERVAL '30 days';

Change Failure Rate: Requires linking deployments to incidents. If your incident management tool (PagerDuty, Opsgenie, custom) tags incidents with the deployment that caused them, you can compute this directly. Otherwise, approximate it by counting deployments followed by a rollback within N hours.

OpenTelemetry for CI Pipelines#

OpenTelemetry (OTel) traces model CI pipelines naturally: a pipeline is a trace, each job is a span, and each step within a job is a child span. This gives you distributed-tracing-style visibility into your build process.

GitHub Actions with OTel#

Use the otel-cicd-action or emit spans directly:

jobs:
  build:
    runs-on: ubuntu-latest
    env:
      OTEL_EXPORTER_OTLP_ENDPOINT: https://otel-collector.internal:4318
      OTEL_SERVICE_NAME: ci-pipeline
    steps:
      - name: Start trace
        id: trace
        run: |
          TRACE_ID=$(openssl rand -hex 16)
          SPAN_ID=$(openssl rand -hex 8)
          START_TIME=$(date +%s%N)
          echo "trace_id=$TRACE_ID" >> "$GITHUB_OUTPUT"
          echo "span_id=$SPAN_ID" >> "$GITHUB_OUTPUT"
          echo "start_time=$START_TIME" >> "$GITHUB_OUTPUT"

      - uses: actions/checkout@v4

      - name: Build
        run: go build -o app ./cmd/server

      - name: Test
        run: go test ./... -v

      - name: Emit trace
        if: always()
        run: |
          END_TIME=$(date +%s%N)
          curl -X POST "$OTEL_EXPORTER_OTLP_ENDPOINT/v1/traces" \
            -H "Content-Type: application/json" \
            -d '{
              "resourceSpans": [{
                "resource": {"attributes": [
                  {"key": "service.name", "value": {"stringValue": "ci-pipeline"}},
                  {"key": "ci.pipeline.id", "value": {"stringValue": "${{ github.run_id }}"}},
                  {"key": "ci.repository", "value": {"stringValue": "${{ github.repository }}"}}
                ]},
                "scopeSpans": [{
                  "spans": [{
                    "traceId": "${{ steps.trace.outputs.trace_id }}",
                    "spanId": "${{ steps.trace.outputs.span_id }}",
                    "name": "build-and-test",
                    "kind": 1,
                    "startTimeUnixNano": "${{ steps.trace.outputs.start_time }}",
                    "endTimeUnixNano": "'"$END_TIME"'",
                    "status": {"code": "${{ job.status == 'success' && '1' || '2' }}"}
                  }]
                }]
              }]
            }'

For a more robust approach, use dedicated CI observability tools like Honeycomb’s buildevents or Grafana’s CI/CD integration that automatically instrument pipelines without manual span management.

GitLab Native OTel#

GitLab has built-in OpenTelemetry trace export. Enable it in project settings under CI/CD > General pipelines > OpenTelemetry tracing, and provide your OTLP endpoint. GitLab automatically creates traces for pipelines, spans for jobs, and child spans for sections within jobs.

Grafana Dashboards#

A CI/CD dashboard should answer three questions at a glance: Is CI healthy right now? Is it getting better or worse over time? Where should we invest optimization effort?

Dashboard Layout#

Row 1 – Current Health:

Pipeline success rate (last 24h) – single stat, green/yellow/red thresholds
Median build duration (last 24h) – single stat with trend spark line
Current queue depth – gauge showing jobs waiting for runners
Active runners – count of runners currently executing jobs

Row 2 – Trends:

Build duration p50/p95/p99 over time – time series graph, 7-day window
Failure rate over time – time series with annotation markers for deployment events
Queue time over time – time series, correlated with runner count

Row 3 – Breakdown:

Duration by pipeline stage – stacked bar chart showing where time is spent
Failure reasons – pie chart (test failure, infra failure, timeout, OOM)
Slowest pipelines – table showing the 10 longest-running pipelines this week

Example Grafana Panel (JSON Model)#

{
  "title": "Pipeline Duration p95",
  "type": "timeseries",
  "datasource": "Prometheus",
  "targets": [{
    "expr": "histogram_quantile(0.95, sum(rate(ci_pipeline_duration_seconds_bucket[1h])) by (le))",
    "legendFormat": "p95"
  }, {
    "expr": "histogram_quantile(0.50, sum(rate(ci_pipeline_duration_seconds_bucket[1h])) by (le))",
    "legendFormat": "p50"
  }],
  "fieldConfig": {
    "defaults": {
      "unit": "s",
      "thresholds": {
        "steps": [
          {"value": 0, "color": "green"},
          {"value": 600, "color": "yellow"},
          {"value": 1200, "color": "red"}
        ]
      }
    }
  }
}

Alerting on Build Regressions#

Alert on symptoms, not causes. A 10% increase in build duration on a single pipeline is noise. A 30% increase in p95 build duration across all pipelines sustained for 2 hours is a signal.

Effective Alert Rules#

# Prometheus alerting rules
groups:
  - name: ci-pipeline-alerts
    rules:
      - alert: HighPipelineFailureRate
        expr: |
          sum(rate(ci_pipeline_completed_total{status="failed", branch_type="main"}[1h]))
          /
          sum(rate(ci_pipeline_completed_total{branch_type="main"}[1h]))
          > 0.3
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Main branch pipeline failure rate above 30% for 30 minutes"

      - alert: PipelineDurationRegression
        expr: |
          histogram_quantile(0.95, sum(rate(ci_pipeline_duration_seconds_bucket[2h])) by (le))
          >
          1.5 * histogram_quantile(0.95, sum(rate(ci_pipeline_duration_seconds_bucket[2h] offset 7d)) by (le))
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Pipeline p95 duration 50% higher than same time last week"

      - alert: HighQueueTime
        expr: |
          histogram_quantile(0.95, sum(rate(ci_job_queue_duration_seconds_bucket[30m])) by (le))
          > 300
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Jobs waiting more than 5 minutes for runners"

The duration regression alert compares current performance against the same period last week, which accounts for weekly patterns (e.g., more builds on weekdays). The for clause prevents alerting on transient spikes.

Getting Started#

If you have no pipeline observability today, start with three things:

Export pipeline events to a time-series database. Most CI systems have webhooks or APIs that emit pipeline completion events. Write a small service that receives these events and writes them to Prometheus (via Pushgateway) or a metrics endpoint.
Build one dashboard with four panels: success rate, median duration, queue time, and failure count by type. This takes an hour and immediately reveals patterns you did not know existed.
Set one alert: main branch failure rate above 30% for 30 minutes. This catches systemic problems (broken dependency, infrastructure outage) without alerting on individual developer test failures.

Iterate from there. Add DORA metrics once you have reliable deployment and incident tracking. Add OTel traces when you need to diagnose why specific pipelines are slow. The goal is not comprehensive observability on day one – it is a feedback loop that improves over time.