Circuit Breaker and Resilience Patterns#
In a microservice architecture, any downstream dependency can fail. Without resilience patterns, a single slow or failing service cascades into total system failure. Resilience patterns prevent this by failing fast, isolating failures, and recovering gracefully.
Circuit Breaker#
The circuit breaker pattern monitors calls to a downstream service and stops making calls when failures reach a threshold. It has three states.
States#
Closed (normal operation): All requests pass through. The circuit breaker counts failures. When failures exceed the threshold within a time window, the breaker trips to Open.
Open (failing fast): All requests are immediately rejected without calling the downstream service. This prevents piling requests onto an already failing service. After a configurable wait duration, the breaker moves to Half-Open.
Half-Open (probing): A limited number of requests are allowed through. If they succeed, the breaker resets to Closed. If any fail, the breaker returns to Open for another wait period.
success
┌──────────────────────────────┐
│ │
v failure threshold │
[CLOSED] ──────────────────> [OPEN]
^ │
│ timeout expires │
│ v
└──────── success ─────── [HALF-OPEN]
│
failure
│
v
[OPEN]Istio/Envoy Circuit Breaking#
Istio configures circuit breaking through DestinationRules. Envoy implements this as outlier detection (ejecting unhealthy hosts) and connection pool limits.
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: payments-api
namespace: payments
spec:
host: payments-api
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: DEFAULT
http1MaxPendingRequests: 50
http2MaxRequests: 100
maxRequestsPerConnection: 10
maxRetries: 3
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 50connectionPool limits prevent overwhelming a service:
maxConnections: Maximum TCP connections to the host. Requests beyond this are queued or rejected.http1MaxPendingRequests: Maximum pending HTTP/1.1 requests. Once exceeded, new requests get 503.http2MaxRequests: Maximum concurrent HTTP/2 requests.maxRetries: Maximum concurrent retries across all requests. Prevents retry storms.
outlierDetection is Envoy’s circuit breaker for individual hosts behind a service:
consecutive5xxErrors: Number of consecutive 5xx errors before ejecting the host from the load balancing pool.interval: How often the ejection check runs.baseEjectionTime: How long a host stays ejected. Actual time isbaseEjectionTime * number_of_ejections, so a host that keeps failing stays out longer each time.maxEjectionPercent: Maximum percentage of hosts that can be ejected simultaneously. Setting 50 means at least half of your backends always receive traffic, even if they are failing. This prevents total blackout.
Application-Level: Resilience4j (Java/Kotlin)#
Resilience4j provides decorators you wrap around function calls. The circuit breaker is configured with failure rate thresholds and timing windows.
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // trip when 50% of calls fail
.slowCallRateThreshold(80) // trip when 80% of calls are slow
.slowCallDurationThreshold(Duration.ofSeconds(2))
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(5)
.slidingWindowType(SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10) // evaluate last 10 calls
.minimumNumberOfCalls(5) // need at least 5 calls before tripping
.build();
CircuitBreaker breaker = CircuitBreaker.of("payments", config);
Supplier<PaymentResult> decorated = CircuitBreaker
.decorateSupplier(breaker, () -> paymentService.process(order));
Try<PaymentResult> result = Try.ofSupplier(decorated)
.recover(CallNotPermittedException.class, e -> PaymentResult.fallback());Key configuration parameters:
slidingWindowSizeandslidingWindowType: COUNT_BASED evaluates the last N calls. TIME_BASED evaluates calls within the last N seconds. COUNT_BASED is simpler and more predictable.minimumNumberOfCalls: Do not trip the breaker until you have enough data. Without this, two failures in a row after startup would trip a 50% threshold.slowCallDurationThreshold: Treat slow calls as failures. A service returning 200 after 10 seconds is effectively failing.
Application-Level: Polly (.NET)#
var circuitBreakerPolicy = Policy
.Handle<HttpRequestException>()
.OrResult<HttpResponseMessage>(r => r.StatusCode == HttpStatusCode.ServiceUnavailable)
.CircuitBreakerAsync(
handledEventsAllowedBeforeBreaking: 5,
durationOfBreak: TimeSpan.FromSeconds(30),
onBreak: (result, duration) => logger.LogWarning("Circuit opened for {Duration}s", duration.TotalSeconds),
onReset: () => logger.LogInformation("Circuit closed"),
onHalfOpen: () => logger.LogInformation("Circuit half-open, probing")
);Retry with Exponential Backoff#
Retrying failed requests is straightforward. Retrying without backoff creates a thundering herd that makes things worse. Exponential backoff with jitter is the standard approach.
The Formula#
delay = min(base_delay * 2^attempt + random_jitter, max_delay)Without jitter, all clients retry at the same intervals, creating synchronized spikes. Add random jitter to spread retries across the time window.
Istio Retry Configuration#
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: payments-api
namespace: payments
spec:
hosts:
- payments-api
http:
- route:
- destination:
host: payments-api
retries:
attempts: 3
perTryTimeout: 2s
retryOn: "5xx,reset,connect-failure,retriable-4xx"
timeout: 10sattempts: Maximum retry count. The total number of tries isattempts + 1(original + retries).perTryTimeout: Timeout for each individual attempt, including retries. Without this, a slow call consumes the entire timeout budget on the first try, leaving nothing for retries.retryOn: Which conditions trigger a retry.5xxcovers server errors.resetcovers connection resets.connect-failurecovers TCP connection failures.retriable-4xxcovers 409 Conflict.timeout: Total timeout for the request including all retries. Set this toperTryTimeout * (attempts + 1)at minimum, or less if you want a hard budget.
Resilience4j Retry#
RetryConfig retryConfig = RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.ofMillis(500))
.intervalFunction(IntervalFunction.ofExponentialBackoff(500, 2.0))
.retryOnResult(response -> response.getStatusCode() >= 500)
.retryExceptions(IOException.class, TimeoutException.class)
.ignoreExceptions(BusinessException.class)
.build();Only retry on transient failures. A 400 Bad Request will not succeed on the second try. A 503 Service Unavailable might. Never retry non-idempotent operations (like payment processing) unless you have idempotency keys.
Bulkhead Pattern#
The bulkhead pattern isolates failure domains so one slow dependency does not exhaust resources needed for other operations. The name comes from ship bulkheads that prevent a single hull breach from flooding the entire vessel.
Thread Pool Isolation#
Assign separate thread pools to different downstream calls. If the payments service is slow and its thread pool fills up, the inventory service call still has its own pool.
BulkheadConfig bulkheadConfig = BulkheadConfig.custom()
.maxConcurrentCalls(25)
.maxWaitDuration(Duration.ofMillis(500))
.build();
Bulkhead paymentsBulkhead = Bulkhead.of("payments", bulkheadConfig);
Bulkhead inventoryBulkhead = Bulkhead.of("inventory", bulkheadConfig);maxConcurrentCalls: Maximum simultaneous calls allowed. Requests beyond this are queued.maxWaitDuration: How long queued requests wait before being rejected. Set this low. Waiting 30 seconds to call a service that is probably down anyway wastes resources.
Istio Connection Pool as Bulkhead#
Istio’s connectionPool settings in DestinationRules act as a bulkhead per destination:
trafficPolicy:
connectionPool:
tcp:
maxConnections: 50
http:
http1MaxPendingRequests: 25
http2MaxRequests: 50This limits the blast radius of a slow payments service to 50 concurrent connections. Other services are unaffected.
Timeout Management#
Every network call needs a timeout. Without one, a hanging connection holds resources indefinitely. The question is how to set them.
Timeout Budget#
Work backwards from the user-facing SLA. If the API gateway has a 5-second timeout for the user, and the request hits three services sequentially, each service gets less than 5 seconds total.
User -> Gateway (5s) -> Service A (2s) -> Service B (1.5s) -> Service C (1s)Account for network latency between hops. If each hop adds 50ms, your budget is tighter than you think.
Per-Service Timeout in Istio#
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: payments-api
spec:
hosts:
- payments-api
http:
- route:
- destination:
host: payments-api
timeout: 3sCommon Mistakes#
Timeout too long: Setting 30-second timeouts means threads are blocked for 30 seconds when things fail. Under load, this exhausts your thread pool.
Timeout too short: Setting 100ms timeouts when the P99 latency is 80ms means 1% of normal requests fail. Monitor actual latency distributions before setting timeouts.
No timeout at all: The default in most HTTP clients is infinite or very long. Always set explicit timeouts on every outgoing call.
Combining Patterns#
These patterns work best together. A typical resilience stack for a service call:
Timeout -> Retry (with backoff) -> Circuit Breaker -> Bulkhead -> Actual CallThe timeout prevents individual calls from hanging. The retry handles transient failures. The circuit breaker stops calling a service that is consistently failing. The bulkhead limits resource consumption per dependency.
In Resilience4j, compose decorators:
Supplier<Response> resilientCall = Decorators.ofSupplier(() -> callPaymentService())
.withBulkhead(paymentsBulkhead)
.withCircuitBreaker(paymentsCircuitBreaker)
.withRetry(paymentsRetry)
.withFallback(asList(CallNotPermittedException.class),
e -> Response.fallback("Payment service unavailable"))
.decorate();The order matters. Resilience4j applies decorators from the outside in. The outermost decorator (fallback) executes first, wrapping everything else. The innermost (bulkhead) wraps the actual call.
Tuning Guidance#
Do not guess at configuration values. Start with conservative defaults and adjust based on production metrics:
- Measure P50, P95, and P99 latency of the downstream service under normal load.
- Set timeouts to slightly above P99 (give 20-30% headroom).
- Set circuit breaker failure threshold based on acceptable error rates (5-10% is typical).
- Set bulkhead limits based on measured peak concurrent calls plus 50% headroom.
- Monitor circuit breaker state changes. If the breaker trips frequently, the downstream service has a reliability problem that needs fixing at the source.