Building Resilient Microservices with Circuit Breakers and Retries
At 3 AM on a Tuesday, our payment service started timing out. Within minutes, the timeout cascaded: the order service hung waiting for payment responses, the API gateway filled its thread pool with pending order requests, and the entire platform went down. All because one database in one service was running a slow vacuum operation.
This is the cascade failure problem, and it's the single biggest operational risk in microservice architectures. A monolith fails all at once — dramatic but simple. Microservices fail in creative, unpredictable ways that can turn a minor issue in one service into a total system outage.
The solution is resilience engineering: circuit breakers, retries with backoff, bulkheads, timeouts, and fallbacks. These patterns don't prevent failures — they contain them. This guide covers the theory, the implementation, and the production lessons I've learned running microservices at scale.
Part 1: Why Microservices Fail Differently
In a monolith, if the database is slow, everything is slow — but nothing crashes. The slow query eventually completes, the response eventually returns, and the system recovers. In microservices, a slow dependency is often worse than a dead one.
The Three Failure Modes
- Hard failure: The service is down. Connection refused. This is actually the easiest to handle — you get an immediate error and can respond appropriately.
- Slow failure: The service is up but responding slowly. This is the killer. Your calling service's threads are blocked waiting, new requests queue up, and you run out of resources. According to Microsoft's Azure Architecture documentation, slow responses are the most common trigger for cascade failures.
- Partial failure: The service works for some requests but fails for others. A database connection pool exhaustion, for example, might cause 90% of requests to succeed and 10% to fail randomly.
The Cascade Effect
// What happens without resilience patterns:
User Request
→ API Gateway (thread pool: 200 threads)
→ Order Service (thread pool: 50 threads)
→ Payment Service (SLOW - 30s response time)
→ Database (vacuum running, queries slow)
// Timeline:
// T+0s: Payment service starts responding slowly (30s instead of 200ms)
// T+30s: Order service has 50 threads waiting on payment. New requests queue.
// T+60s: API gateway has 150 threads waiting on order service.
// T+90s: API gateway thread pool exhausted. ALL endpoints return 503.
// T+90s: Health checks start failing. Load balancer marks all instances unhealthy.
// T+120s: Complete outage. Even endpoints that don't need payment service are down.
The payment database was slow. The entire platform went down. This is unacceptable, and it's entirely preventable.
Part 2: Circuit Breakers
The circuit breaker pattern, popularized by Michael Nygard in Release It! and formalized by Martin Fowler, is modeled after electrical circuit breakers. When too many failures occur, the circuit "opens" and stops sending requests to the failing service, giving it time to recover.
The Three States
| State | Behavior | Transitions To |
|---|---|---|
| Closed (normal) | Requests flow through. Failures are counted. | Open (when failure threshold exceeded) |
| Open (tripped) | Requests fail immediately without calling the service. Returns fallback. | Half-Open (after timeout period) |
| Half-Open (testing) | A limited number of test requests are sent through. | Closed (if test requests succeed) or Open (if they fail) |
Implementation in Node.js
// circuit-breaker.ts
interface CircuitBreakerOptions {
failureThreshold: number; // Number of failures before opening
successThreshold: number; // Number of successes in half-open to close
timeout: number; // Time in ms before trying half-open
fallback?: () => any; // Fallback response when open
}
class CircuitBreaker {
private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
private failureCount = 0;
private successCount = 0;
private lastFailureTime: number | null = null;
private options: CircuitBreakerOptions;
constructor(options: CircuitBreakerOptions) {
this.options = {
failureThreshold: 5,
successThreshold: 3,
timeout: 30000,
...options,
};
}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'OPEN') {
if (this.shouldAttemptReset()) {
this.state = 'HALF_OPEN';
console.log('[CircuitBreaker] Transitioning to HALF_OPEN');
} else {
console.log('[CircuitBreaker] OPEN - returning fallback');
if (this.options.fallback) {
return this.options.fallback() as T;
}
throw new Error('Circuit breaker is OPEN');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess(): void {
if (this.state === 'HALF_OPEN') {
this.successCount++;
if (this.successCount >= this.options.successThreshold) {
this.state = 'CLOSED';
this.failureCount = 0;
this.successCount = 0;
console.log('[CircuitBreaker] Transitioning to CLOSED');
}
} else {
this.failureCount = 0;
}
}
private onFailure(): void {
this.failureCount++;
this.lastFailureTime = Date.now();
if (this.failureCount >= this.options.failureThreshold) {
this.state = 'OPEN';
console.log('[CircuitBreaker] Transitioning to OPEN');
}
if (this.state === 'HALF_OPEN') {
this.state = 'OPEN';
this.successCount = 0;
}
}
private shouldAttemptReset(): boolean {
return (
this.lastFailureTime !== null &&
Date.now() - this.lastFailureTime >= this.options.timeout
);
}
getState() {
return this.state;
}
}
// Usage:
const paymentCircuit = new CircuitBreaker({
failureThreshold: 5,
successThreshold: 3,
timeout: 30000,
fallback: () => ({ status: 'pending', message: 'Payment service temporarily unavailable' }),
});
async function processPayment(orderId: string) {
return paymentCircuit.execute(async () => {
const response = await fetch('https://payment-service/charge', {
method: 'POST',
body: JSON.stringify({ orderId }),
signal: AbortSignal.timeout(5000), // 5s timeout
});
if (!response.ok) throw new Error(`Payment failed: ${response.status}`);
return response.json();
});
}
Using Established Libraries
For production, use battle-tested libraries rather than rolling your own:
// Node.js: opossum
import CircuitBreaker from 'opossum';
const breaker = new CircuitBreaker(callPaymentService, {
timeout: 5000, // 5s timeout per request
errorThresholdPercentage: 50, // Open at 50% error rate
resetTimeout: 30000, // Try half-open after 30s
volumeThreshold: 10, // Minimum requests before tripping
});
breaker.on('open', () => metrics.increment('circuit.payment.open'));
breaker.on('close', () => metrics.increment('circuit.payment.close'));
breaker.on('halfOpen', () => metrics.increment('circuit.payment.halfOpen'));
breaker.on('fallback', () => metrics.increment('circuit.payment.fallback'));
breaker.fallback(() => ({ status: 'pending' }));
const result = await breaker.fire(orderId);
// Java/Spring: Resilience4j
@CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
public PaymentResponse processPayment(String orderId) {
return paymentClient.charge(orderId);
}
public PaymentResponse paymentFallback(String orderId, Exception e) {
return new PaymentResponse("pending", "Payment service unavailable");
}
// application.yml
resilience4j:
circuitbreaker:
instances:
paymentService:
slidingWindowSize: 10
failureRateThreshold: 50
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 3
Part 3: Retry Strategies
Not every failure is permanent. A network blip, a momentary database overload, or a brief deployment — these are transient failures that resolve on their own. Retries handle transient failures, but naive retries make things worse.
The Retry Amplification Problem
Imagine a service handling 1,000 requests per second, and it goes down for 5 seconds. Without retries, the system sees 5,000 failed requests. With 3 retries per request, the system sees 5,000 + 15,000 = 20,000 requests when it comes back up. This surge often causes a second failure. This is documented extensively in Amazon's Builders' Library on retries.
Exponential Backoff with Jitter
The solution is exponential backoff (wait longer between each retry) combined with jitter (add randomness to prevent thundering herd):
// retry.ts
interface RetryOptions {
maxRetries: number;
baseDelay: number; // Initial delay in ms
maxDelay: number; // Cap the delay
jitterFactor: number; // 0 to 1
retryableErrors?: string[];
}
async function withRetry<T>(
fn: () => Promise<T>,
options: RetryOptions
): Promise<T> {
const { maxRetries, baseDelay, maxDelay, jitterFactor } = options;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
const isLastAttempt = attempt === maxRetries;
const isRetryable = isRetryableError(error, options.retryableErrors);
if (isLastAttempt || !isRetryable) {
throw error;
}
// Exponential backoff: baseDelay * 2^attempt
const exponentialDelay = baseDelay * Math.pow(2, attempt);
// Cap at maxDelay
const cappedDelay = Math.min(exponentialDelay, maxDelay);
// Add jitter: random value between 0 and cappedDelay * jitterFactor
const jitter = Math.random() * cappedDelay * jitterFactor;
const finalDelay = cappedDelay + jitter;
console.log(
`[Retry] Attempt ${attempt + 1}/${maxRetries} failed. ` +
`Retrying in ${Math.round(finalDelay)}ms`
);
await sleep(finalDelay);
}
}
throw new Error('Unreachable');
}
function isRetryableError(error: any, retryableErrors?: string[]): boolean {
// Network errors are always retryable
if (error.code === 'ECONNRESET' || error.code === 'ETIMEDOUT') return true;
// HTTP 429 (rate limit) and 5xx (server error) are retryable
if (error.status === 429 || (error.status >= 500 && error.status < 600)) return true;
// HTTP 4xx (client error) are NOT retryable (except 429)
if (error.status >= 400 && error.status < 500) return false;
return retryableErrors?.includes(error.code) ?? false;
}
function sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
// Usage:
const user = await withRetry(
() => userService.getUser(userId),
{
maxRetries: 3,
baseDelay: 200, // 200ms, 400ms, 800ms
maxDelay: 5000,
jitterFactor: 0.5,
}
);
Retry Budget
A more sophisticated approach is a retry budget: limit the total number of retries across all requests, not per request. Google's SRE book chapter on handling overload recommends allowing retries only when the retry rate is below 10% of total requests.
// retry-budget.ts
class RetryBudget {
private requests = 0;
private retries = 0;
private windowMs: number;
private maxRetryRatio: number;
private window: { time: number; type: 'request' | 'retry' }[] = [];
constructor(windowMs = 60000, maxRetryRatio = 0.1) {
this.windowMs = windowMs;
this.maxRetryRatio = maxRetryRatio;
}
recordRequest() {
this.cleanup();
this.window.push({ time: Date.now(), type: 'request' });
}
canRetry(): boolean {
this.cleanup();
const requests = this.window.filter(w => w.type === 'request').length;
const retries = this.window.filter(w => w.type === 'retry').length;
if (requests === 0) return true;
return retries / requests < this.maxRetryRatio;
}
recordRetry() {
this.window.push({ time: Date.now(), type: 'retry' });
}
private cleanup() {
const cutoff = Date.now() - this.windowMs;
this.window = this.window.filter(w => w.time > cutoff);
}
}
Part 4: Timeouts — The Most Important Setting
If I could only configure one resilience mechanism, it would be timeouts. A missing or too-generous timeout is the root cause of almost every cascade failure I've investigated.
Timeout Strategy
// Every external call must have a timeout
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 5000);
try {
const response = await fetch('https://api.example.com/data', {
signal: controller.signal,
});
return response.json();
} finally {
clearTimeout(timeoutId);
}
// Layered timeouts: each layer shorter than the one above
// API Gateway: 30s (total request budget)
// Service A → B: 10s
// Service B → DB: 3s
// This ensures that inner timeouts fire before outer ones,
// giving each layer a chance to handle the failure gracefully.
Timeout Guidelines
| Call Type | Recommended Timeout | Rationale |
|---|---|---|
| Database query | 1-5s | Queries should be fast; slow queries indicate a problem |
| Cache (Redis) | 100-500ms | Cache misses should be fast; if cache is slow, skip it |
| Internal service call | 3-10s | Depends on the operation; read vs write |
| External API | 5-30s | You don't control external services |
| File upload | 30-120s | Large payloads need more time |
Part 5: Bulkheads — Isolating Failures
The bulkhead pattern, named after the compartments in a ship's hull, isolates different parts of your system so that a failure in one doesn't sink the whole ship.
Thread Pool Bulkheads
// Separate thread pools (connection pools in Node.js) for different dependencies
import { Agent } from 'undici';
// Payment service gets its own connection pool
const paymentAgent = new Agent({
connections: 20, // Max 20 concurrent connections
pipelining: 1,
connectTimeout: 5000,
});
// User service gets its own connection pool
const userAgent = new Agent({
connections: 50, // More connections (higher traffic)
pipelining: 1,
connectTimeout: 5000,
});
// If payment service exhausts its pool, user service is unaffected
async function getPayment(id: string) {
return fetch(`https://payment-service/payments/${id}`, {
dispatcher: paymentAgent,
});
}
async function getUser(id: string) {
return fetch(`https://user-service/users/${id}`, {
dispatcher: userAgent,
});
}
Semaphore Bulkheads
// Limit concurrent calls to a dependency
class Semaphore {
private permits: number;
private waiting: (() => void)[] = [];
constructor(permits: number) {
this.permits = permits;
}
async acquire(): Promise<void> {
if (this.permits > 0) {
this.permits--;
return;
}
return new Promise<void>(resolve => {
this.waiting.push(resolve);
});
}
release(): void {
if (this.waiting.length > 0) {
const next = this.waiting.shift()!;
next();
} else {
this.permits++;
}
}
}
// Only allow 10 concurrent payment processing calls
const paymentSemaphore = new Semaphore(10);
async function processPayment(orderId: string) {
await paymentSemaphore.acquire();
try {
return await paymentService.charge(orderId);
} finally {
paymentSemaphore.release();
}
}
Part 6: Fallback Strategies
When a service is unavailable (circuit open, retries exhausted), what do you return? The fallback strategy depends on the business context:
Fallback Hierarchy
- Cache fallback: Return stale data from cache. For read operations, this is often acceptable. "Here are the job listings as of 5 minutes ago" is better than "Service unavailable."
- Default value: Return a sensible default. A recommendation engine might return popular items instead of personalized ones.
- Degraded mode: Disable the feature entirely. If payment processing is down, allow users to browse but show "Checkout temporarily unavailable."
- Queue for later: Accept the request, put it in a queue, and process it when the dependency recovers. Best for write operations that aren't time-sensitive.
// Fallback hierarchy example
async function getJobRecommendations(userId: string) {
try {
// Primary: personalized recommendations from ML service
return await recommendationCircuit.execute(
() => mlService.getRecommendations(userId)
);
} catch {
try {
// Fallback 1: cached recommendations
const cached = await redis.get(`recommendations:${userId}`);
if (cached) return JSON.parse(cached);
} catch { /* redis also down */ }
try {
// Fallback 2: popular jobs (pre-computed, stored locally)
return await getPopularJobs();
} catch {
// Fallback 3: empty with message
return {
jobs: [],
message: 'Recommendations temporarily unavailable',
};
}
}
}
Part 7: Putting It All Together
Here's how all these patterns combine in a real service:
// resilient-client.ts
import CircuitBreaker from 'opossum';
function createResilientClient(serviceName: string, baseUrl: string) {
// Bulkhead: dedicated connection pool
const agent = new Agent({ connections: 20, connectTimeout: 5000 });
// Retry budget: limit retry storms
const retryBudget = new RetryBudget(60000, 0.1);
// Core request function with timeout
async function makeRequest(path: string, options: RequestInit = {}) {
const response = await fetch(`${baseUrl}${path}`, {
...options,
dispatcher: agent,
signal: AbortSignal.timeout(5000), // Timeout
});
if (!response.ok) {
const error = new Error(`${serviceName} error: ${response.status}`);
(error as any).status = response.status;
throw error;
}
return response.json();
}
// Wrap with retries
async function makeRequestWithRetry(path: string, options?: RequestInit) {
retryBudget.recordRequest();
return withRetry(
() => makeRequest(path, options),
{
maxRetries: retryBudget.canRetry() ? 2 : 0,
baseDelay: 200,
maxDelay: 2000,
jitterFactor: 0.5,
}
);
}
// Wrap with circuit breaker
const breaker = new CircuitBreaker(makeRequestWithRetry, {
timeout: 10000,
errorThresholdPercentage: 50,
resetTimeout: 30000,
volumeThreshold: 10,
});
// Metrics
breaker.on('success', () => metrics.increment(`${serviceName}.success`));
breaker.on('failure', () => metrics.increment(`${serviceName}.failure`));
breaker.on('open', () => metrics.increment(`${serviceName}.circuit_open`));
return {
get: (path: string) => breaker.fire(path),
post: (path: string, body: any) =>
breaker.fire(path, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(body),
}),
};
}
// Usage:
const paymentClient = createResilientClient('payment', 'https://payment-service');
const userClient = createResilientClient('user', 'https://user-service');
const payment = await paymentClient.post('/charge', { orderId: '123' });
Part 8: My Opinionated Take
After years of building and operating microservices, here are my strongest opinions:
1. Timeouts are more important than circuit breakers. A circuit breaker without a timeout is useless — the circuit never opens because calls hang forever instead of failing. Set timeouts first, add circuit breakers second.
2. Retries should be opt-in, not default. Most teams add retries everywhere and wonder why their systems are less reliable. Retries amplify load during failures. Use them only for idempotent operations and always with backoff + jitter.
3. You need fewer microservices than you think. Every network boundary is a reliability risk. If two services always deploy together and always call each other, they should probably be one service. The microservices.io patterns catalog has good guidance on when to split.
4. Test failure modes, not just success paths. Chaos engineering isn't just for Netflix. Kill a dependency in staging and verify that your fallbacks work. The first time you discover your circuit breaker is misconfigured should not be at 3 AM.
Part 9: Action Plan
Week 1: Audit
- List every external dependency for each service
- Check: does every call have a timeout? (Bet: half of them don't)
- Identify: which failures would cascade to other services?
- Review: are retries configured? With backoff? With jitter?
Week 2: Implement
- Add timeouts to every external call
- Add circuit breakers to the most critical dependencies
- Implement fallbacks for read paths (cache, defaults)
- Add retry budgets to prevent retry storms
Week 3: Test
- Kill each dependency in staging and observe behavior
- Add latency to each dependency and observe circuit breaker behavior
- Verify that fallbacks return sensible data
- Set up alerts for circuit state changes
Ongoing
- Monitor circuit breaker state and failure rates in production
- Run game days: intentionally inject failures and practice recovery
- Review and tune thresholds quarterly
Sources
- Martin Fowler: Circuit Breaker
- Microsoft Azure: Circuit Breaker Pattern
- Amazon Builders' Library: Timeouts, Retries, and Backoff with Jitter
- Google SRE Book: Handling Overload
- Microservices.io: Decomposition Patterns
- Resilience4j Documentation
I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.
