Testing Strategies in CI Pipelines#

Every CI pipeline faces the same tension: run enough tests to catch bugs before they merge, but not so many that developers wait twenty minutes for feedback on a one-line change. The answer is not running everything everywhere. It is running the right tests at the right time.

The Three Test Tiers#

Tests divide into three tiers based on execution speed, failure signal quality, and infrastructure requirements.

Unit tests execute in milliseconds per test. They test individual functions or classes in isolation, mock external dependencies, and require no running services. A failing unit test points directly at the broken code.

Integration tests execute in seconds per test. They test interactions between components – your API handler talking to a real database, your service calling another service through HTTP. They require running infrastructure and fail for more reasons than unit tests.

End-to-end (e2e) tests execute in seconds to minutes per test. They exercise the entire system from the user’s perspective. They catch issues that no other test catches: broken deployments, misconfigured services, UI rendering problems. They are also the slowest, the most fragile, and the hardest to debug.

When to Run Which Tests#

The decision of which tests to run depends on the pipeline trigger. Each trigger has different time budgets and risk profiles.

Pull Request Pipelines#

Time budget: 5-10 minutes. This is the feedback loop developers experience on every push. Slow PR pipelines kill velocity.

Run all unit tests. Always. They are fast enough to run every time and they catch the majority of regressions. Run integration tests for the changed components only. If the PR modifies the payment service, run payment service integration tests, not the entire integration suite. Skip full e2e tests on PRs unless the change touches critical user flows.

# GitHub Actions example: PR pipeline
on:
  pull_request:
    branches: [main]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: make test-unit

  integration-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Detect changed packages
        id: changes
        run: |
          CHANGED=$(git diff --name-only origin/main...HEAD | cut -d/ -f1 | sort -u)
          echo "packages=$CHANGED" >> "$GITHUB_OUTPUT"
      - name: Run affected integration tests
        run: make test-integration PACKAGES="${{ steps.changes.outputs.packages }}"

Merge-to-Main Pipelines#

Time budget: 15-20 minutes. Code has already passed PR checks. This pipeline confirms the merged result is healthy.

Run all unit tests. Run all integration tests. Run a smoke suite of critical e2e tests – the five to ten tests that cover your most important user workflows (login, checkout, data export). This catches integration conflicts between PRs that were each individually green but break when combined.

Nightly Pipelines#

Time budget: 1-2 hours. No developer is waiting. This is your safety net.

Run everything. Full unit suite, full integration suite, full e2e suite. Run performance benchmarks. Run the tests you skip during the day because they take too long. This is where you detect slow-burn regressions and flaky tests that only fail intermittently.

on:
  schedule:
    - cron: '0 3 * * *'  # 3 AM UTC daily

jobs:
  full-suite:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: make test-all
      - run: make test-e2e-full
      - run: make test-performance

Test Parallelization#

Serial test execution is the most common cause of slow CI. Parallelization strategies differ by test tier.

Unit tests parallelize trivially. Most test frameworks support parallel execution natively. Go runs test packages in parallel by default. Jest uses worker processes. pytest uses pytest-xdist:

# Go: parallel by default, control with -p flag
go test ./... -p 8

# Jest: workers based on CPU count
npx jest --maxWorkers=4

# pytest: distribute across workers
pytest -n auto

Integration tests need isolated infrastructure per parallel worker. Two integration tests hitting the same database will interfere with each other. Solutions include database-per-worker (create a fresh database for each parallel shard), schema-per-worker (cheaper than separate databases), or container-per-worker using testcontainers:

# pytest with testcontainers: each worker gets its own database
import testcontainers.postgres

@pytest.fixture(scope="session")
def db():
    with PostgresContainer("postgres:16") as pg:
        yield pg.get_connection_url()

E2e tests parallelize by running independent test suites in separate CI jobs. The constraint is the test environment – you need enough capacity to handle parallel traffic without interference.

CI-Level Parallelization#

Split tests across multiple CI jobs using sharding:

jobs:
  unit-tests:
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    steps:
      - run: make test-unit SHARD=${{ matrix.shard }} TOTAL_SHARDS=4

Many CI platforms support automatic test splitting based on historical timing data. CircleCI and BuildKite do this natively. For GitHub Actions, tools like split-tests distribute tests based on recorded durations.

Flaky Test Management#

A flaky test is one that fails intermittently without any code change. Flaky tests are worse than missing tests because they erode trust in the pipeline. When developers see a red build and assume “that’s just the flaky test again,” they stop looking at failures. Real bugs slip through.

Detection#

Track test results over time. A test that passes 95% of the time and fails 5% is flaky. Most CI systems can generate JUnit XML reports that you can aggregate:

- name: Run tests with reporting
  run: |
    go test ./... -v -count=1 2>&1 | go-junit-report > report.xml
- uses: dorny/test-reporter@v1
  with:
    name: Test Results
    path: report.xml
    reporter: java-junit

Quarantine#

Once identified, quarantine flaky tests immediately. Do not leave them in the main pipeline where they block merges. Move them to a separate CI job that runs but does not gate:

jobs:
  stable-tests:
    # This gates merges
    steps:
      - run: make test-unit SKIP_QUARANTINED=true

  quarantined-tests:
    # This runs but does not block
    continue-on-error: true
    steps:
      - run: make test-quarantined

Tag quarantined tests in your test framework so they are easy to filter:

func TestPaymentProcessing(t *testing.T) {
    if os.Getenv("SKIP_QUARANTINED") == "true" {
        t.Skip("quarantined: flaky due to race condition in mock server")
    }
    // ...
}

Resolution#

Quarantine is not the solution. It is triage. Every quarantined test needs a fix or a deletion. Common causes of flakiness include timing dependencies (use explicit waits instead of sleeps), shared state between tests (isolate test data), non-deterministic ordering (ensure tests do not depend on execution order), and external service availability (mock external calls in CI).

Set a policy: quarantined tests that are not fixed within two weeks get deleted. A test nobody fixes is a test nobody values.

Coverage Thresholds#

Code coverage measures what percentage of your code is executed by tests. It is a useful floor, not a ceiling. High coverage does not mean good tests, but low coverage guarantees untested code.

Setting Thresholds#

Set coverage gates in CI to prevent coverage from dropping on new code:

- name: Run tests with coverage
  run: go test ./... -coverprofile=coverage.out

- name: Check coverage threshold
  run: |
    COVERAGE=$(go tool cover -func=coverage.out | grep total | awk '{print $3}' | tr -d '%')
    if (( $(echo "$COVERAGE < 70" | bc -l) )); then
      echo "Coverage $COVERAGE% is below 70% threshold"
      exit 1
    fi

Recommended thresholds: 70-80% for application code is a reasonable target. Demanding 100% leads to brittle tests that test implementation details rather than behavior. Infrastructure code (Terraform modules, Helm charts) benefits more from integration tests than unit coverage metrics.

Diff Coverage#

More useful than total coverage is diff coverage – the coverage of lines changed in a PR. Tools like diff-cover and Codecov show coverage on changed lines, catching new features with 0% coverage.

Test Reporting#

Test results are useless if nobody looks at them. JUnit XML is the universal report format – every major test framework can produce it, and CI platforms parse it to show results inline on PRs. Always generate JUnit reports. Use tools like dorny/test-reporter to post test summaries as PR comments. For nightly runs, track pass rate, execution time, and flaky test count over time to catch slow-burn regressions.

Decision Summary#

Trigger	Unit Tests	Integration Tests	E2E Tests	Time Budget
PR push	All	Changed packages only	Skip or smoke only	5-10 min
Merge to main	All	All	Critical smoke suite	15-20 min
Nightly	All	All	Full suite + performance	1-2 hours

The goal is not to run every test on every trigger. It is to build confidence proportional to the risk: fast feedback for small changes, thorough validation before and after merging, exhaustive verification overnight.