Load Testing Strategies: Tools, Patterns, and CI Integration

Why Load Test#

Performance problems discovered in production are expensive. A service that handles 100 requests per second in dev might collapse at 500 in production because connection pools exhaust, garbage collection pauses compound, or a downstream service starts throttling. Load testing reveals these limits before users do.

Load testing answers specific questions: What is the maximum throughput before errors start? At what concurrency does latency degrade beyond acceptable limits? Can the system sustain expected traffic for hours without resource leaks? Will a traffic spike cause cascading failures?

Test Types#

Each test type answers a different question. Running only one type leaves blind spots.

Smoke test: Minimal load (1-5 virtual users) for a short duration (1-2 minutes). Verifies the test script works and the system is functional under trivial load. Run this first to catch script errors before committing to a longer test.

Load test: Expected production traffic for a sustained period (15-60 minutes). Validates the system handles normal demand. Use this as your baseline. If you can only run one test type, run this one.

Stress test: Traffic ramped beyond expected levels to find the breaking point. Increase load in steps (100%, 150%, 200% of expected traffic) and observe where errors begin and latency degrades. The goal is not to survive the stress – it is to find where things break and how they break.

Soak test: Expected load for an extended period (4-24 hours). Reveals memory leaks, connection pool exhaustion, log disk filling, certificate rotation failures, and other time-dependent problems that short tests miss.

Spike test: Sudden burst of traffic followed by a drop. Tests autoscaling responsiveness and queue handling. Simulates events like a marketing email send, a viral social media post, or a flash sale.

Tool Comparison#

k6#

Best for: Teams that want tests as code in JavaScript/TypeScript, CI integration, and modern developer experience.

k6 scripts are JavaScript. Tests are single files that are easy to version, review, and run in CI. Performance is excellent – k6 is written in Go and generates load from a single binary without requiring a JVM or Python runtime.

// k6-load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';

const errorRate = new Rate('errors');
const latency = new Trend('request_latency');

export const options = {
  stages: [
    { duration: '2m', target: 50 },   // ramp up
    { duration: '5m', target: 50 },   // hold steady
    { duration: '2m', target: 100 },  // increase
    { duration: '5m', target: 100 },  // hold
    { duration: '2m', target: 0 },    // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<300', 'p(99)<500'],
    errors: ['rate<0.01'],
  },
};

export default function () {
  const res = http.get('https://api.example.com/products');
  check(res, {
    'status is 200': (r) => r.status === 200,
    'response time < 300ms': (r) => r.timings.duration < 300,
  });
  errorRate.add(res.status !== 200);
  latency.add(res.timings.duration);
  sleep(1);
}

Run with: k6 run k6-load-test.js

k6 thresholds fail the test if performance criteria are not met, making it natural for CI gates. Output integrates with Prometheus, Datadog, Grafana Cloud, and InfluxDB.

Locust#

Best for: Teams comfortable with Python who need complex user behavior simulation and a real-time web UI for monitoring test runs.

Locust models users as Python classes. Each user executes tasks with realistic think times. The web UI shows real-time charts during test execution, which is useful for exploratory testing.

# locustfile.py
from locust import HttpUser, task, between

class WebsiteUser(HttpUser):
    wait_time = between(1, 3)

    def on_start(self):
        self.client.post("/login", json={
            "username": "testuser",
            "password": "testpass"
        })

    @task(3)
    def browse_products(self):
        self.client.get("/api/products")

    @task(1)
    def view_product(self):
        self.client.get("/api/products/42")

    @task(1)
    def search(self):
        self.client.get("/api/search?q=widget")

Run with: locust -f locustfile.py --host https://api.example.com

Locust uses Python’s event-based I/O (gevent), so a single process can simulate thousands of users. For very high load, distribute across multiple workers with --master and --worker flags.

Gatling#

Best for: Teams using the JVM ecosystem (Scala/Java/Kotlin) who need detailed HTML reports and precise request timing.

Gatling generates rich HTML reports with detailed percentile breakdowns automatically. Its DSL is expressive but requires Scala knowledge. The recorder tool generates scripts from browser sessions.

// PaymentSimulation.scala
class PaymentSimulation extends Simulation {
  val httpProtocol = http
    .baseUrl("https://api.example.com")
    .acceptHeader("application/json")

  val scn = scenario("Payment Flow")
    .exec(http("Get Products").get("/api/products"))
    .pause(2)
    .exec(http("Add to Cart").post("/api/cart")
      .body(StringBody("""{"product_id": 42, "quantity": 1}"""))
      .check(status.is(200)))
    .pause(1)
    .exec(http("Checkout").post("/api/checkout")
      .check(status.is(201)))

  setUp(
    scn.inject(
      rampUsers(100).during(60),
      constantUsersPerSec(20).during(300)
    )
  ).protocols(httpProtocol)
    .assertions(
      global.responseTime.percentile3.lt(500),
      global.successfulRequests.percent.gt(99)
    )
}

JMeter#

Best for: Teams needing a GUI-based tool, protocol support beyond HTTP (JDBC, LDAP, FTP, JMS), or compatibility with existing enterprise test plans.

JMeter has the broadest protocol support and the largest community. The GUI is useful for building test plans but should not be used for running them – use command-line mode (jmeter -n -t testplan.jmx) for actual load generation. JMeter consumes significantly more memory per virtual user than k6 or Locust.

Decision Matrix#

Factor	k6	Locust	Gatling	JMeter
Language	JavaScript	Python	Scala	XML/GUI
Resource efficiency	Excellent	Good	Good	Fair
CI integration	Native	Script	Plugin	Plugin
Protocol support	HTTP, gRPC, WebSocket	HTTP, custom	HTTP, WebSocket	HTTP, JDBC, LDAP, FTP, JMS
Real-time monitoring	Grafana integration	Built-in web UI	Console	Listeners (heavy)
Reporting	Integrations	CSV/web	HTML (excellent)	JTL/HTML
Learning curve	Low	Low	Medium	Medium
Distributed testing	k6 Cloud or xk6-disruptor	Built-in master/worker	Gatling Enterprise	Built-in remote

Choose k6 when you want tests-as-code, CI gates on performance, and minimal infrastructure overhead. It is the best default choice for most teams.

Choose Locust when your team is Python-native and you need flexible user behavior modeling with a monitoring UI during development.

Choose Gatling when you need polished automated reports for stakeholders and your team is comfortable with Scala.

Choose JMeter when you need to test non-HTTP protocols or have existing JMeter test plans that work.

Realistic Traffic Modeling#

Synthetic benchmarks that hit a single endpoint in a loop do not reflect real traffic. Real users browse, pause, search, add items, and occasionally do unexpected things.

Model realistic traffic with these techniques:

Weighted endpoints: Distribute requests across endpoints proportional to real traffic. If 60% of production traffic hits /api/products, 25% hits /api/search, and 15% hits /api/checkout, your test should match.

Think time: Users pause between requests. Add sleep(1-3s) between actions. Without think time, your test models an automated bot, not a human user.

Session flow: Real users follow paths – browse, select, add to cart, checkout. Model these as sequences, not independent requests.

Data variation: Use parameterized data for search queries, product IDs, and user credentials. Hitting the same product ID every time means you are testing cache performance, not application performance.

Ramp patterns: Traffic does not jump from zero to peak. Ramp up over 2-5 minutes to let connection pools warm up, caches fill, and JIT compilers optimize.

Performance Baselines#

A baseline is a known-good performance profile for your system. Without a baseline, you cannot tell if a change made things worse.

Establish baselines by running the same load test weekly against a stable version:

Baseline for payment-api (2026-02-15, v2.3.1):
  50 concurrent users, 5 minutes steady state
  Throughput: 245 req/s
  p50 latency: 42ms
  p95 latency: 128ms
  p99 latency: 340ms
  Error rate: 0.02%
  CPU utilization: 35%
  Memory utilization: 420MB

Store baselines alongside test scripts in version control. Compare each test run against the baseline to detect regressions. A 20% increase in p99 latency between releases is a signal worth investigating even if it is still within SLO bounds.

CI Integration#

Run load tests in CI to catch performance regressions before they reach production.

GitHub Actions example with k6:

# .github/workflows/load-test.yml
name: Load Test
on:
  pull_request:
    branches: [main]

jobs:
  load-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Start application
        run: docker compose up -d
        working-directory: ./deploy

      - name: Wait for healthy
        run: |
          for i in $(seq 1 30); do
            curl -sf http://localhost:8080/health && break
            sleep 2
          done

      - name: Install k6
        run: |
          sudo gpg -k
          sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg \
            --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D68
          echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \
            | sudo tee /etc/apt/sources.list.d/k6.list
          sudo apt-get update && sudo apt-get install k6

      - name: Run load test
        run: k6 run --out json=results.json tests/load-test.js

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: k6-results
          path: results.json

The k6 thresholds in the test script act as the CI gate. If p95 latency exceeds the threshold or error rate is too high, k6 exits with a non-zero status code and the CI job fails.

What to test in CI: Run smoke tests on every PR. Run full load tests nightly or before releases. Soak tests are too long for CI – run them as a scheduled job in a dedicated environment.

What not to test in CI: Do not run load tests against shared staging environments from CI – other teams’ tests will skew results. Use a dedicated ephemeral environment or a local Docker Compose setup.

Interpreting Results#

Focus on percentiles, not averages. An average response time of 100ms might hide the fact that 1% of users wait 5 seconds. Report p50 (median experience), p95 (most users’ worst case), and p99 (tail latency).

Watch for these patterns in results:

Latency climbs with concurrency: Connection pool or thread pool exhaustion. Check pool sizes and waiting queue behavior.
Errors spike at a specific throughput: A dependency is rate-limiting or a resource ceiling has been hit. Identify the bottleneck – is it CPU, memory, database connections, or an external API?
Memory grows over time during soak tests: Memory leak. Profile the application and check for unclosed connections, growing caches, or accumulating goroutines/threads.
Latency is bimodal (cluster of fast and cluster of slow): Cold cache misses, garbage collection pauses, or requests hitting different backend instances with different performance characteristics.