Load Balancer Patterns: L4 vs L7, Health Checks, Session Affinity, and Cloud LB Selection

L4 vs L7 Load Balancing#

The distinction between Layer 4 and Layer 7 load balancing determines what the load balancer can see and what routing decisions it can make.

Layer 4 (Transport) load balancers work at the TCP/UDP level. They see source/destination IPs and ports but not the content of the traffic. They forward raw TCP connections to backends. This makes them fast (no protocol parsing), protocol-agnostic (works for HTTP, gRPC, database connections, custom protocols), and transparent (the backend sees the original packets, mostly). Use L4 for database connections, raw TCP services, and when you need maximum throughput with minimum latency.

Layer 7 (Application) load balancers understand HTTP. They can inspect headers, URL paths, cookies, and query parameters. This enables sophisticated routing: send /api/* to the API backend, /static/* to the CDN origin, and route based on the Host header for multi-tenant setups. They can also modify requests (add headers, rewrite URLs) and terminate TLS. Use L7 for HTTP services, path-based routing, host-based routing, and when you need TLS termination or WAF integration.

L4 decision: "TCP connection to port 443 → forward to backend pool"
L7 decision: "GET /api/users with Host: app.example.com → forward to api-service"

Health Check Configuration#

Health checks determine whether a backend is capable of serving traffic. A poorly configured health check causes either flapping (backend repeatedly marked up and down) or serving traffic to dead backends.

Types of health checks:

TCP: connects to the port, considers healthy if the connection succeeds. Fast but only proves the port is open, not that the application is working.
HTTP: sends a GET request to a path, checks for a 200 response. Proves the application can serve requests.
gRPC: uses the gRPC health checking protocol. Required for gRPC services.

An effective HTTP health check endpoint tests meaningful dependencies:

# Good: checks database connectivity and returns quickly
@app.route('/healthz')
def health():
    try:
        db.execute("SELECT 1")
        return jsonify({"status": "healthy"}), 200
    except Exception as e:
        return jsonify({"status": "unhealthy", "error": str(e)}), 503

# Bad: returns 200 even if the database is down
@app.route('/healthz')
def health():
    return "OK", 200

Tuning parameters:

Interval:           10 seconds  (how often to check)
Timeout:            5 seconds   (max wait for a response)
Healthy threshold:  2           (consecutive successes to mark healthy)
Unhealthy threshold: 3          (consecutive failures to mark unhealthy)

Setting the interval too low (1-2 seconds) on a health check that queries a database creates unnecessary load. Setting the unhealthy threshold too low (1) causes flapping on transient network blips. A reasonable starting point is checking every 10 seconds with an unhealthy threshold of 3, meaning a backend is removed after 30 seconds of failures.

Connection Draining#

When a backend is removed from the pool (scaling down, deploying, failing health checks), in-flight requests should be allowed to complete rather than being abruptly terminated. Connection draining (also called deregistration delay) handles this.

# AWS ALB target group - Terraform
resource "aws_lb_target_group" "app" {
  name     = "app-tg"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id

  deregistration_delay = 30  # seconds to drain connections

  health_check {
    path                = "/healthz"
    interval            = 10
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
    matcher             = "200"
  }
}

Set the drain timeout to slightly longer than your longest expected request. For a typical web API with a 30-second request timeout, 30 seconds of draining is appropriate. For batch processing endpoints, increase it. For WebSocket connections, increase it substantially or handle reconnection in the client.

Session Affinity#

Session affinity (sticky sessions) routes requests from the same client to the same backend. It is needed when backend servers maintain client state in memory – shopping carts, WebSocket connections, or server-side sessions.

Cookie-based (L7): the load balancer sets a cookie on the first response, and subsequent requests with that cookie go to the same backend. This is the most reliable approach since it survives IP changes (mobile networks).

IP-based (L4 or L7): requests from the same source IP go to the same backend. Breaks when many users share the same IP (corporate NAT, mobile carrier NAT).

# Kubernetes Service with session affinity
apiVersion: v1
kind: Service
metadata:
  name: stateful-app
spec:
  type: ClusterIP
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600
  ports:
    - port: 80
      targetPort: 8080
  selector:
    app: stateful-app

The tradeoff: session affinity prevents even load distribution. If one backend gets stuck with all the “heavy” sessions, it becomes overloaded while others sit idle. Prefer stateless backends with externalized session storage (Redis, database) whenever possible, and reserve session affinity for legacy applications that require it.

Load Balancing Algorithms#

Round-robin: requests go to backends in order, cycling through the list. Simple and works well when all requests take roughly the same time.

Least connections: new requests go to the backend with the fewest active connections. Better than round-robin when request processing times vary – slow requests pile up on one backend under round-robin, but least connections naturally distributes load.

IP hash: the client IP determines which backend handles the request, providing a form of session affinity without cookies. Consistent as long as the backend pool does not change.

Weighted: backends receive traffic in proportion to assigned weights. Useful for gradual rollouts (new version: weight 10, old version: weight 90) or heterogeneous hardware.

For most HTTP services, least connections is the best default. For services behind a CDN where most responses are similar, round-robin is sufficient.

TLS Termination Patterns#

Where TLS is decrypted determines the security and operational characteristics of the architecture.

Termination at the load balancer (most common): the LB decrypts TLS and forwards plain HTTP to backends. Simpler certificate management (one place), LB can inspect HTTP for routing, and backends avoid the CPU overhead of TLS. The tradeoff is that traffic between the LB and backends is unencrypted, which is acceptable within a VPC but may not meet compliance requirements.

End-to-end (passthrough): the LB forwards encrypted traffic to backends without decrypting it. The backend handles TLS directly. Used when the LB must not see the traffic content (regulatory requirement) or for non-HTTP protocols. The LB cannot do L7 routing since it cannot read the traffic.

Re-encryption: the LB terminates the client TLS connection, inspects the traffic, then opens a new TLS connection to the backend. Provides L7 routing capability while encrypting traffic to backends. Double the TLS overhead, but satisfies compliance requirements for encryption in transit.

Termination at LB:    Client ──TLS──> LB ──HTTP──> Backend
Passthrough:          Client ──TLS────────────────> Backend
Re-encryption:        Client ──TLS──> LB ──TLS──> Backend

Cloud Load Balancer Selection#

AWS#

ALB (Application Load Balancer): L7. HTTP/HTTPS, WebSocket, gRPC. Path-based and host-based routing, WAF integration, authentication. $0.0225/hour + $0.008/LCU.
NLB (Network Load Balancer): L4. TCP/UDP/TLS. Static IPs, extreme performance (millions of requests/second), preserves source IP. $0.0225/hour + $0.006/NLCU. Cheaper per unit than ALB.
CLB (Classic Load Balancer): Legacy. Do not use for new deployments.

Azure#

Application Gateway: L7. HTTP/HTTPS, path-based routing, WAF, SSL termination. Equivalent to AWS ALB.
Azure Load Balancer: L4. TCP/UDP, high performance. Equivalent to AWS NLB.

GCP#

HTTP(S) Load Balancer: L7. Global (anycast), HTTP/2, gRPC, path-based routing. Unique in that it is global by default.
TCP/UDP Load Balancer: L4. Regional or global. Network-level load balancing.

Cost Considerations#

Cloud load balancers charge per hour of operation plus per unit of data or connections processed. For a typical web application:

AWS ALB: ~$20/month base + usage (~$50-200/month total for moderate traffic)
AWS NLB: ~$20/month base + usage (cheaper per connection than ALB)

Cost reduction strategies: consolidate multiple services behind a single ALB using host-based routing instead of deploying one ALB per service. Use Kubernetes Ingress controllers (NGINX, Traefik) behind a single NLB to handle L7 routing in-cluster.

WebSocket Support#

WebSocket connections start as HTTP and upgrade to a persistent bidirectional connection. Not all load balancers handle this well.

ALB supports WebSocket natively but has a default idle timeout of 60 seconds. Long-lived WebSocket connections need a longer timeout or application-level keepalive:

resource "aws_lb_target_group" "websocket" {
  name     = "ws-tg"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id

  stickiness {
    type            = "lb_cookie"
    cookie_duration = 86400
  }
}

resource "aws_lb" "websocket" {
  idle_timeout = 3600  # 1 hour for WebSocket connections
}

Common Gotchas#

Health checks too aggressive: an interval of 2 seconds with an unhealthy threshold of 1 means a single missed health check (maybe a garbage collection pause) removes the backend from the pool. The backend passes the next check 2 seconds later and is re-added. During the removal, in-flight requests to that backend may fail. Set the unhealthy threshold to at least 3.

LB timeout shorter than backend timeout: if the ALB idle timeout is 60 seconds but your application takes 90 seconds to process a request, the ALB closes the connection and returns a 504 Gateway Timeout to the client while the backend continues processing. The backend eventually completes the work but has no one to send the response to. Always set the LB timeout longer than your application’s maximum expected response time. For ALB, the idle_timeout attribute controls this. For API calls that may take long, consider asynchronous processing with polling instead of long-lived HTTP connections.