Building Resilient Microservices with Circuit Breakers and Retries

Intricate circuit board patterns representing distributed system architecture

At 3 AM on a Tuesday, our payment service started timing out. Within minutes, the timeout cascaded: the order service hung waiting for payment responses, the API gateway filled its thread pool with pending order requests, and the entire platform went down. All because one database in one service was running a slow vacuum operation.

This is the cascade failure problem, and it's the single biggest operational risk in microservice architectures. A monolith fails all at once — dramatic but simple. Microservices fail in creative, unpredictable ways that can turn a minor issue in one service into a total system outage.

The solution is resilience engineering: circuit breakers, retries with backoff, bulkheads, timeouts, and fallbacks. These patterns don't prevent failures — they contain them. This guide covers the theory, the implementation, and the production lessons I've learned running microservices at scale.

Part 1: Why Microservices Fail Differently

In a monolith, if the database is slow, everything is slow — but nothing crashes. The slow query eventually completes, the response eventually returns, and the system recovers. In microservices, a slow dependency is often worse than a dead one.

The Three Failure Modes

Hard failure: The service is down. Connection refused. This is actually the easiest to handle — you get an immediate error and can respond appropriately.
Slow failure: The service is up but responding slowly. This is the killer. Your calling service's threads are blocked waiting, new requests queue up, and you run out of resources. According to Microsoft's Azure Architecture documentation, slow responses are the most common trigger for cascade failures.
Partial failure: The service works for some requests but fails for others. A database connection pool exhaustion, for example, might cause 90% of requests to succeed and 10% to fail randomly.

The Cascade Effect

// What happens without resilience patterns:

User Request
  → API Gateway (thread pool: 200 threads)
    → Order Service (thread pool: 50 threads)
      → Payment Service (SLOW - 30s response time)
        → Database (vacuum running, queries slow)

// Timeline:
// T+0s:   Payment service starts responding slowly (30s instead of 200ms)
// T+30s:  Order service has 50 threads waiting on payment. New requests queue.
// T+60s:  API gateway has 150 threads waiting on order service.
// T+90s:  API gateway thread pool exhausted. ALL endpoints return 503.
// T+90s:  Health checks start failing. Load balancer marks all instances unhealthy.
// T+120s: Complete outage. Even endpoints that don't need payment service are down.

The payment database was slow. The entire platform went down. This is unacceptable, and it's entirely preventable.

Part 2: Circuit Breakers

Abstract visualization of circuit breaker states and transitions

The circuit breaker pattern, popularized by Michael Nygard in Release It! and formalized by Martin Fowler, is modeled after electrical circuit breakers. When too many failures occur, the circuit "opens" and stops sending requests to the failing service, giving it time to recover.

The Three States

State	Behavior	Transitions To
Closed (normal)	Requests flow through. Failures are counted.	Open (when failure threshold exceeded)
Open (tripped)	Requests fail immediately without calling the service. Returns fallback.	Half-Open (after timeout period)
Half-Open (testing)	A limited number of test requests are sent through.	Closed (if test requests succeed) or Open (if they fail)

Implementation in Node.js

// circuit-breaker.ts
interface CircuitBreakerOptions {
  failureThreshold: number;    // Number of failures before opening
  successThreshold: number;    // Number of successes in half-open to close
  timeout: number;             // Time in ms before trying half-open
  fallback?: () => any;        // Fallback response when open
}

class CircuitBreaker {
  private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
  private failureCount = 0;
  private successCount = 0;
  private lastFailureTime: number | null = null;
  private options: CircuitBreakerOptions;

  constructor(options: CircuitBreakerOptions) {
    this.options = {
      failureThreshold: 5,
      successThreshold: 3,
      timeout: 30000,
      ...options,
    };
  }

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'OPEN') {
      if (this.shouldAttemptReset()) {
        this.state = 'HALF_OPEN';
        console.log('[CircuitBreaker] Transitioning to HALF_OPEN');
      } else {
        console.log('[CircuitBreaker] OPEN - returning fallback');
        if (this.options.fallback) {
          return this.options.fallback() as T;
        }
        throw new Error('Circuit breaker is OPEN');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess(): void {
    if (this.state === 'HALF_OPEN') {
      this.successCount++;
      if (this.successCount >= this.options.successThreshold) {
        this.state = 'CLOSED';
        this.failureCount = 0;
        this.successCount = 0;
        console.log('[CircuitBreaker] Transitioning to CLOSED');
      }
    } else {
      this.failureCount = 0;
    }
  }

  private onFailure(): void {
    this.failureCount++;
    this.lastFailureTime = Date.now();

    if (this.failureCount >= this.options.failureThreshold) {
      this.state = 'OPEN';
      console.log('[CircuitBreaker] Transitioning to OPEN');
    }

    if (this.state === 'HALF_OPEN') {
      this.state = 'OPEN';
      this.successCount = 0;
    }
  }

  private shouldAttemptReset(): boolean {
    return (
      this.lastFailureTime !== null &&
      Date.now() - this.lastFailureTime >= this.options.timeout
    );
  }

  getState() {
    return this.state;
  }
}

// Usage:
const paymentCircuit = new CircuitBreaker({
  failureThreshold: 5,
  successThreshold: 3,
  timeout: 30000,
  fallback: () => ({ status: 'pending', message: 'Payment service temporarily unavailable' }),
});

async function processPayment(orderId: string) {
  return paymentCircuit.execute(async () => {
    const response = await fetch('https://payment-service/charge', {
      method: 'POST',
      body: JSON.stringify({ orderId }),
      signal: AbortSignal.timeout(5000), // 5s timeout
    });

    if (!response.ok) throw new Error(`Payment failed: ${response.status}`);
    return response.json();
  });
}

Using Established Libraries

For production, use battle-tested libraries rather than rolling your own:

// Node.js: opossum
import CircuitBreaker from 'opossum';

const breaker = new CircuitBreaker(callPaymentService, {
  timeout: 5000,           // 5s timeout per request
  errorThresholdPercentage: 50,  // Open at 50% error rate
  resetTimeout: 30000,     // Try half-open after 30s
  volumeThreshold: 10,     // Minimum requests before tripping
});

breaker.on('open', () => metrics.increment('circuit.payment.open'));
breaker.on('close', () => metrics.increment('circuit.payment.close'));
breaker.on('halfOpen', () => metrics.increment('circuit.payment.halfOpen'));
breaker.on('fallback', () => metrics.increment('circuit.payment.fallback'));

breaker.fallback(() => ({ status: 'pending' }));

const result = await breaker.fire(orderId);

// Java/Spring: Resilience4j
@CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
public PaymentResponse processPayment(String orderId) {
    return paymentClient.charge(orderId);
}

public PaymentResponse paymentFallback(String orderId, Exception e) {
    return new PaymentResponse("pending", "Payment service unavailable");
}

// application.yml
resilience4j:
  circuitbreaker:
    instances:
      paymentService:
        slidingWindowSize: 10
        failureRateThreshold: 50
        waitDurationInOpenState: 30s
        permittedNumberOfCallsInHalfOpenState: 3

Part 3: Retry Strategies

Layered system architecture showing retry patterns

Not every failure is permanent. A network blip, a momentary database overload, or a brief deployment — these are transient failures that resolve on their own. Retries handle transient failures, but naive retries make things worse.

The Retry Amplification Problem

Imagine a service handling 1,000 requests per second, and it goes down for 5 seconds. Without retries, the system sees 5,000 failed requests. With 3 retries per request, the system sees 5,000 + 15,000 = 20,000 requests when it comes back up. This surge often causes a second failure. This is documented extensively in Amazon's Builders' Library on retries.

Exponential Backoff with Jitter

The solution is exponential backoff (wait longer between each retry) combined with jitter (add randomness to prevent thundering herd):

// retry.ts
interface RetryOptions {
  maxRetries: number;
  baseDelay: number;       // Initial delay in ms
  maxDelay: number;        // Cap the delay
  jitterFactor: number;    // 0 to 1
  retryableErrors?: string[];
}

async function withRetry<T>(
  fn: () => Promise<T>,
  options: RetryOptions
): Promise<T> {
  const { maxRetries, baseDelay, maxDelay, jitterFactor } = options;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      const isLastAttempt = attempt === maxRetries;
      const isRetryable = isRetryableError(error, options.retryableErrors);

      if (isLastAttempt || !isRetryable) {
        throw error;
      }

      // Exponential backoff: baseDelay * 2^attempt
      const exponentialDelay = baseDelay * Math.pow(2, attempt);

      // Cap at maxDelay
      const cappedDelay = Math.min(exponentialDelay, maxDelay);

      // Add jitter: random value between 0 and cappedDelay * jitterFactor
      const jitter = Math.random() * cappedDelay * jitterFactor;
      const finalDelay = cappedDelay + jitter;

      console.log(
        `[Retry] Attempt ${attempt + 1}/${maxRetries} failed. ` +
        `Retrying in ${Math.round(finalDelay)}ms`
      );

      await sleep(finalDelay);
    }
  }

  throw new Error('Unreachable');
}

function isRetryableError(error: any, retryableErrors?: string[]): boolean {
  // Network errors are always retryable
  if (error.code === 'ECONNRESET' || error.code === 'ETIMEDOUT') return true;

  // HTTP 429 (rate limit) and 5xx (server error) are retryable
  if (error.status === 429 || (error.status >= 500 && error.status < 600)) return true;

  // HTTP 4xx (client error) are NOT retryable (except 429)
  if (error.status >= 400 && error.status < 500) return false;

  return retryableErrors?.includes(error.code) ?? false;
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

// Usage:
const user = await withRetry(
  () => userService.getUser(userId),
  {
    maxRetries: 3,
    baseDelay: 200,      // 200ms, 400ms, 800ms
    maxDelay: 5000,
    jitterFactor: 0.5,
  }
);

Retry Budget

A more sophisticated approach is a retry budget: limit the total number of retries across all requests, not per request. Google's SRE book chapter on handling overload recommends allowing retries only when the retry rate is below 10% of total requests.

// retry-budget.ts
class RetryBudget {
  private requests = 0;
  private retries = 0;
  private windowMs: number;
  private maxRetryRatio: number;
  private window: { time: number; type: 'request' | 'retry' }[] = [];

  constructor(windowMs = 60000, maxRetryRatio = 0.1) {
    this.windowMs = windowMs;
    this.maxRetryRatio = maxRetryRatio;
  }

  recordRequest() {
    this.cleanup();
    this.window.push({ time: Date.now(), type: 'request' });
  }

  canRetry(): boolean {
    this.cleanup();
    const requests = this.window.filter(w => w.type === 'request').length;
    const retries = this.window.filter(w => w.type === 'retry').length;

    if (requests === 0) return true;
    return retries / requests < this.maxRetryRatio;
  }

  recordRetry() {
    this.window.push({ time: Date.now(), type: 'retry' });
  }

  private cleanup() {
    const cutoff = Date.now() - this.windowMs;
    this.window = this.window.filter(w => w.time > cutoff);
  }
}

Part 4: Timeouts — The Most Important Setting

If I could only configure one resilience mechanism, it would be timeouts. A missing or too-generous timeout is the root cause of almost every cascade failure I've investigated.

Timeout Strategy

// Every external call must have a timeout
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 5000);

try {
  const response = await fetch('https://api.example.com/data', {
    signal: controller.signal,
  });
  return response.json();
} finally {
  clearTimeout(timeoutId);
}

// Layered timeouts: each layer shorter than the one above
// API Gateway:     30s (total request budget)
// Service A → B:  10s
// Service B → DB:  3s
// This ensures that inner timeouts fire before outer ones,
// giving each layer a chance to handle the failure gracefully.

Timeout Guidelines

Call Type	Recommended Timeout	Rationale
Database query	1-5s	Queries should be fast; slow queries indicate a problem
Cache (Redis)	100-500ms	Cache misses should be fast; if cache is slow, skip it
Internal service call	3-10s	Depends on the operation; read vs write
External API	5-30s	You don't control external services
File upload	30-120s	Large payloads need more time

Part 5: Bulkheads — Isolating Failures

Ship compartment design illustrating bulkhead isolation pattern

The bulkhead pattern, named after the compartments in a ship's hull, isolates different parts of your system so that a failure in one doesn't sink the whole ship.

Thread Pool Bulkheads

// Separate thread pools (connection pools in Node.js) for different dependencies
import { Agent } from 'undici';

// Payment service gets its own connection pool
const paymentAgent = new Agent({
  connections: 20,           // Max 20 concurrent connections
  pipelining: 1,
  connectTimeout: 5000,
});

// User service gets its own connection pool
const userAgent = new Agent({
  connections: 50,           // More connections (higher traffic)
  pipelining: 1,
  connectTimeout: 5000,
});

// If payment service exhausts its pool, user service is unaffected
async function getPayment(id: string) {
  return fetch(`https://payment-service/payments/${id}`, {
    dispatcher: paymentAgent,
  });
}

async function getUser(id: string) {
  return fetch(`https://user-service/users/${id}`, {
    dispatcher: userAgent,
  });
}

Semaphore Bulkheads

// Limit concurrent calls to a dependency
class Semaphore {
  private permits: number;
  private waiting: (() => void)[] = [];

  constructor(permits: number) {
    this.permits = permits;
  }

  async acquire(): Promise<void> {
    if (this.permits > 0) {
      this.permits--;
      return;
    }

    return new Promise<void>(resolve => {
      this.waiting.push(resolve);
    });
  }

  release(): void {
    if (this.waiting.length > 0) {
      const next = this.waiting.shift()!;
      next();
    } else {
      this.permits++;
    }
  }
}

// Only allow 10 concurrent payment processing calls
const paymentSemaphore = new Semaphore(10);

async function processPayment(orderId: string) {
  await paymentSemaphore.acquire();
  try {
    return await paymentService.charge(orderId);
  } finally {
    paymentSemaphore.release();
  }
}

Part 6: Fallback Strategies

When a service is unavailable (circuit open, retries exhausted), what do you return? The fallback strategy depends on the business context:

Fallback Hierarchy

Cache fallback: Return stale data from cache. For read operations, this is often acceptable. "Here are the job listings as of 5 minutes ago" is better than "Service unavailable."
Default value: Return a sensible default. A recommendation engine might return popular items instead of personalized ones.
Degraded mode: Disable the feature entirely. If payment processing is down, allow users to browse but show "Checkout temporarily unavailable."
Queue for later: Accept the request, put it in a queue, and process it when the dependency recovers. Best for write operations that aren't time-sensitive.

// Fallback hierarchy example
async function getJobRecommendations(userId: string) {
  try {
    // Primary: personalized recommendations from ML service
    return await recommendationCircuit.execute(
      () => mlService.getRecommendations(userId)
    );
  } catch {
    try {
      // Fallback 1: cached recommendations
      const cached = await redis.get(`recommendations:${userId}`);
      if (cached) return JSON.parse(cached);
    } catch { /* redis also down */ }

    try {
      // Fallback 2: popular jobs (pre-computed, stored locally)
      return await getPopularJobs();
    } catch {
      // Fallback 3: empty with message
      return {
        jobs: [],
        message: 'Recommendations temporarily unavailable',
      };
    }
  }
}

Part 7: Putting It All Together

Team collaborating on system architecture design

Here's how all these patterns combine in a real service:

// resilient-client.ts
import CircuitBreaker from 'opossum';

function createResilientClient(serviceName: string, baseUrl: string) {
  // Bulkhead: dedicated connection pool
  const agent = new Agent({ connections: 20, connectTimeout: 5000 });

  // Retry budget: limit retry storms
  const retryBudget = new RetryBudget(60000, 0.1);

  // Core request function with timeout
  async function makeRequest(path: string, options: RequestInit = {}) {
    const response = await fetch(`${baseUrl}${path}`, {
      ...options,
      dispatcher: agent,
      signal: AbortSignal.timeout(5000), // Timeout
    });

    if (!response.ok) {
      const error = new Error(`${serviceName} error: ${response.status}`);
      (error as any).status = response.status;
      throw error;
    }

    return response.json();
  }

  // Wrap with retries
  async function makeRequestWithRetry(path: string, options?: RequestInit) {
    retryBudget.recordRequest();

    return withRetry(
      () => makeRequest(path, options),
      {
        maxRetries: retryBudget.canRetry() ? 2 : 0,
        baseDelay: 200,
        maxDelay: 2000,
        jitterFactor: 0.5,
      }
    );
  }

  // Wrap with circuit breaker
  const breaker = new CircuitBreaker(makeRequestWithRetry, {
    timeout: 10000,
    errorThresholdPercentage: 50,
    resetTimeout: 30000,
    volumeThreshold: 10,
  });

  // Metrics
  breaker.on('success', () => metrics.increment(`${serviceName}.success`));
  breaker.on('failure', () => metrics.increment(`${serviceName}.failure`));
  breaker.on('open', () => metrics.increment(`${serviceName}.circuit_open`));

  return {
    get: (path: string) => breaker.fire(path),
    post: (path: string, body: any) =>
      breaker.fire(path, {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify(body),
      }),
  };
}

// Usage:
const paymentClient = createResilientClient('payment', 'https://payment-service');
const userClient = createResilientClient('user', 'https://user-service');

const payment = await paymentClient.post('/charge', { orderId: '123' });

Part 8: My Opinionated Take

After years of building and operating microservices, here are my strongest opinions:

1. Timeouts are more important than circuit breakers. A circuit breaker without a timeout is useless — the circuit never opens because calls hang forever instead of failing. Set timeouts first, add circuit breakers second.

2. Retries should be opt-in, not default. Most teams add retries everywhere and wonder why their systems are less reliable. Retries amplify load during failures. Use them only for idempotent operations and always with backoff + jitter.

3. You need fewer microservices than you think. Every network boundary is a reliability risk. If two services always deploy together and always call each other, they should probably be one service. The microservices.io patterns catalog has good guidance on when to split.

4. Test failure modes, not just success paths. Chaos engineering isn't just for Netflix. Kill a dependency in staging and verify that your fallbacks work. The first time you discover your circuit breaker is misconfigured should not be at 3 AM.

Part 9: Action Plan

Week 1: Audit

List every external dependency for each service
Check: does every call have a timeout? (Bet: half of them don't)
Identify: which failures would cascade to other services?
Review: are retries configured? With backoff? With jitter?

Week 2: Implement

Add timeouts to every external call
Add circuit breakers to the most critical dependencies
Implement fallbacks for read paths (cache, defaults)
Add retry budgets to prevent retry storms

Week 3: Test

Kill each dependency in staging and observe behavior
Add latency to each dependency and observe circuit breaker behavior
Verify that fallbacks return sensible data
Set up alerts for circuit state changes

Ongoing

Monitor circuit breaker state and failure rates in production
Run game days: intentionally inject failures and practice recovery
Review and tune thresholds quarterly

Sources

I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.

Loading BirJob...

Building Resilient Microservices with Circuit Breakers and Retries

Building Resilient Microservices with Circuit Breakers and Retries

Part 1: Why Microservices Fail Differently

The Three Failure Modes

The Cascade Effect

Part 2: Circuit Breakers

The Three States

Implementation in Node.js

Using Established Libraries

Part 3: Retry Strategies

The Retry Amplification Problem

Exponential Backoff with Jitter

Retry Budget

Part 4: Timeouts — The Most Important Setting

Timeout Strategy

Timeout Guidelines

Part 5: Bulkheads — Isolating Failures

Thread Pool Bulkheads

Semaphore Bulkheads

Part 6: Fallback Strategies

Fallback Hierarchy

Part 7: Putting It All Together

Part 8: My Opinionated Take

Part 9: Action Plan

Week 1: Audit

Week 2: Implement

Week 3: Test

Ongoing

Sources

İş axtarışınıza başlayın

Oxşar məqalələr