The Complete Guide to API Rate Limiting and Quota Management

Dashboard displaying API traffic analytics and rate metrics

BirJob scrapes 80+ job listing websites daily. Every single one of those sites is an API of some kind — whether it's a formal REST API or an HTML page we parse. And every single one has rate limits, even if they don't document them. I learned this the hard way when our scraper IP got banned from three major job boards in one week because we were hitting them too aggressively.

But rate limiting isn't just about being a good API consumer. If you build APIs (and most developers do), you need to protect your own services from abuse, ensure fair usage across clients, and prevent a single misbehaving client from taking down your entire platform.

This guide covers both sides: implementing rate limiting in your APIs, and respecting rate limits when consuming others'. We'll go deep on algorithms, architectures, and the operational reality of managing API quotas at scale.

Part 1: Why Rate Limiting Matters

Without rate limiting, any publicly accessible API is vulnerable to:

Denial of Service (DoS): A single client sending millions of requests can overwhelm your server
Resource exhaustion: A buggy client in a retry loop can consume all your database connections
Cost explosion: If you're on usage-based cloud pricing, runaway API calls translate directly to runaway costs
Unfair usage: One heavy user degrades performance for everyone else

According to Cloudflare's analysis, over 30% of internet traffic is automated bot traffic, and without rate limiting, bots can easily consume 10x the resources of legitimate users.

Part 2: Rate Limiting Algorithms

Algorithm visualization with data flow patterns

1. Fixed Window Counter

The simplest algorithm. Divide time into fixed windows (e.g., 1-minute intervals) and count requests per window.

// Fixed Window implementation
class FixedWindowLimiter {
  private counts: Map<string, { count: number; windowStart: number }> = new Map();
  private windowSize: number;  // in ms
  private maxRequests: number;

  constructor(windowSizeMs: number, maxRequests: number) {
    this.windowSize = windowSizeMs;
    this.maxRequests = maxRequests;
  }

  isAllowed(clientId: string): boolean {
    const now = Date.now();
    const windowStart = Math.floor(now / this.windowSize) * this.windowSize;

    const entry = this.counts.get(clientId);

    if (!entry || entry.windowStart !== windowStart) {
      this.counts.set(clientId, { count: 1, windowStart });
      return true;
    }

    if (entry.count >= this.maxRequests) {
      return false;
    }

    entry.count++;
    return true;
  }
}

// 100 requests per minute
const limiter = new FixedWindowLimiter(60000, 100);

Pros: Simple to implement, low memory usage.

Cons: Boundary problem — a client can send 100 requests at 0:59 and 100 more at 1:00, effectively getting 200 requests in 2 seconds.

2. Sliding Window Log

Keeps a log of all request timestamps and counts how many fall within the current window.

class SlidingWindowLogLimiter {
  private logs: Map<string, number[]> = new Map();
  private windowSize: number;
  private maxRequests: number;

  constructor(windowSizeMs: number, maxRequests: number) {
    this.windowSize = windowSizeMs;
    this.maxRequests = maxRequests;
  }

  isAllowed(clientId: string): boolean {
    const now = Date.now();
    const windowStart = now - this.windowSize;

    let timestamps = this.logs.get(clientId) || [];
    // Remove expired entries
    timestamps = timestamps.filter(t => t > windowStart);

    if (timestamps.length >= this.maxRequests) {
      this.logs.set(clientId, timestamps);
      return false;
    }

    timestamps.push(now);
    this.logs.set(clientId, timestamps);
    return true;
  }
}

Pros: Perfectly accurate, no boundary problem.

Cons: High memory usage (stores every timestamp). Not practical for high-traffic APIs.

3. Sliding Window Counter

A hybrid that approximates the sliding window using the current and previous window counts. This is what most production systems use, according to Cloudflare's engineering blog.

class SlidingWindowCounterLimiter {
  private windows: Map<string, { current: number; previous: number; currentStart: number }> = new Map();
  private windowSize: number;
  private maxRequests: number;

  constructor(windowSizeMs: number, maxRequests: number) {
    this.windowSize = windowSizeMs;
    this.maxRequests = maxRequests;
  }

  isAllowed(clientId: string): boolean {
    const now = Date.now();
    const currentWindow = Math.floor(now / this.windowSize) * this.windowSize;

    let entry = this.windows.get(clientId);

    if (!entry || currentWindow - entry.currentStart >= this.windowSize * 2) {
      entry = { current: 0, previous: 0, currentStart: currentWindow };
      this.windows.set(clientId, entry);
    } else if (currentWindow !== entry.currentStart) {
      entry.previous = entry.current;
      entry.current = 0;
      entry.currentStart = currentWindow;
    }

    // Weighted count: previous window weight based on elapsed time
    const elapsed = now - currentWindow;
    const previousWeight = 1 - (elapsed / this.windowSize);
    const estimatedCount = entry.previous * previousWeight + entry.current;

    if (estimatedCount >= this.maxRequests) {
      return false;
    }

    entry.current++;
    return true;
  }
}

Pros: Good accuracy, low memory, handles boundary problem.

Cons: Approximate (but close enough for production).

4. Token Bucket

The most flexible algorithm. A bucket holds tokens; each request consumes a token. Tokens are added at a fixed rate. If the bucket is empty, requests are rejected. The bucket has a maximum capacity, allowing bursts up to that capacity.

class TokenBucketLimiter {
  private buckets: Map<string, { tokens: number; lastRefill: number }> = new Map();
  private maxTokens: number;
  private refillRate: number;  // tokens per millisecond

  constructor(maxTokens: number, refillRatePerSecond: number) {
    this.maxTokens = maxTokens;
    this.refillRate = refillRatePerSecond / 1000;
  }

  isAllowed(clientId: string, tokensRequired: number = 1): boolean {
    const now = Date.now();
    let bucket = this.buckets.get(clientId);

    if (!bucket) {
      bucket = { tokens: this.maxTokens, lastRefill: now };
      this.buckets.set(clientId, bucket);
    }

    // Refill tokens based on elapsed time
    const elapsed = now - bucket.lastRefill;
    bucket.tokens = Math.min(
      this.maxTokens,
      bucket.tokens + elapsed * this.refillRate
    );
    bucket.lastRefill = now;

    if (bucket.tokens < tokensRequired) {
      return false;
    }

    bucket.tokens -= tokensRequired;
    return true;
  }
}

// 100 tokens max, refills at 10 per second
// Allows bursts of 100, sustained rate of 10/s
const limiter = new TokenBucketLimiter(100, 10);

Pros: Allows bursts, configurable sustained rate, simple mental model.

Cons: Slightly more complex than fixed window.

5. Leaky Bucket

Similar to token bucket but with a queue. Requests are added to the bucket (queue) and processed at a fixed rate. If the bucket is full, new requests are rejected.

Pros: Smooth output rate (good for rate-limiting outgoing requests).

Cons: Adds latency (requests wait in queue). Less common for API rate limiting.

Algorithm Comparison

Algorithm	Accuracy	Memory	Burst Handling	Complexity	Best For
Fixed Window	Low	Very Low	Allows 2x at boundary	Simple	Basic protection
Sliding Window Log	Perfect	High	No bursts	Medium	Low-traffic precise limits
Sliding Window Counter	Good	Low	Smooth	Medium	Production APIs
Token Bucket	Good	Low	Configurable bursts	Medium	APIs needing burst tolerance
Leaky Bucket	Good	Medium	No bursts (queued)	Medium	Smoothing outbound requests

Part 3: Distributed Rate Limiting with Redis

Distributed system architecture with centralized rate limiting

In-memory rate limiting works for a single server. For multiple servers behind a load balancer, you need a shared store. Redis is the standard choice, thanks to its atomic operations and sub-millisecond latency.

Redis Token Bucket with Lua Script

-- rate_limit.lua (atomic operation)
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])  -- tokens per second
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

local bucket = redis.call('hmget', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])

if tokens == nil then
  tokens = max_tokens
  last_refill = now
end

-- Refill
local elapsed = (now - last_refill) / 1000
local new_tokens = math.min(max_tokens, tokens + elapsed * refill_rate)

-- Check
if new_tokens < requested then
  -- Rejected: update tokens but don't consume
  redis.call('hmset', key, 'tokens', new_tokens, 'last_refill', now)
  redis.call('pexpire', key, math.ceil(max_tokens / refill_rate * 1000) + 1000)
  return {0, math.ceil((requested - new_tokens) / refill_rate * 1000)}
end

-- Allowed: consume tokens
new_tokens = new_tokens - requested
redis.call('hmset', key, 'tokens', new_tokens, 'last_refill', now)
redis.call('pexpire', key, math.ceil(max_tokens / refill_rate * 1000) + 1000)
return {1, 0}

// Node.js usage
import Redis from 'ioredis';
import { readFileSync } from 'fs';

const redis = new Redis();
const luaScript = readFileSync('./rate_limit.lua', 'utf8');

async function checkRateLimit(
  clientId: string,
  maxTokens: number = 100,
  refillRate: number = 10
): Promise<{ allowed: boolean; retryAfterMs: number }> {
  const [allowed, retryAfter] = await redis.eval(
    luaScript,
    1,                          // number of keys
    `rate_limit:${clientId}`,   // KEYS[1]
    maxTokens,                  // ARGV[1]
    refillRate,                 // ARGV[2]
    Date.now(),                 // ARGV[3]
    1                           // ARGV[4]
  ) as [number, number];

  return {
    allowed: allowed === 1,
    retryAfterMs: retryAfter,
  };
}

Part 4: HTTP Headers and Client Communication

Good rate limiting communicates clearly with clients. The IETF draft on rate limit headers standardizes these headers:

// Express middleware
function rateLimitMiddleware(req, res, next) {
  const clientId = req.headers['x-api-key'] || req.ip;
  const result = checkRateLimit(clientId);

  // Set standard headers
  res.setHeader('RateLimit-Limit', '100');           // Max requests per window
  res.setHeader('RateLimit-Remaining', result.remaining);  // Requests left
  res.setHeader('RateLimit-Reset', result.resetTime);      // Window reset time (Unix timestamp)

  // Legacy headers (still widely used)
  res.setHeader('X-RateLimit-Limit', '100');
  res.setHeader('X-RateLimit-Remaining', result.remaining);
  res.setHeader('X-RateLimit-Reset', result.resetTime);

  if (!result.allowed) {
    res.setHeader('Retry-After', Math.ceil(result.retryAfterMs / 1000));
    return res.status(429).json({
      error: 'Too Many Requests',
      message: `Rate limit exceeded. Retry after ${result.retryAfterMs}ms`,
      retryAfter: result.retryAfterMs,
    });
  }

  next();
}

Response Format for 429 Errors

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1711708800
Retry-After: 30

{
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "You have exceeded the rate limit of 100 requests per minute.",
    "retryAfter": 30,
    "documentation": "https://api.birjob.com/docs/rate-limiting"
  }
}

Part 5: Quota Management — Beyond Simple Rate Limits

Multi-tier quota management system visualization

Rate limiting (requests per second/minute) is the first layer. Quota management adds longer-term limits tied to subscription tiers, billing, and usage policies.

Multi-Tier Quota System

// Quota tiers
const TIERS = {
  free: {
    requestsPerMinute: 30,
    requestsPerDay: 1000,
    requestsPerMonth: 10000,
    maxResponseSize: '1MB',
    endpoints: ['GET /jobs', 'GET /jobs/:id'],
  },
  starter: {
    requestsPerMinute: 100,
    requestsPerDay: 10000,
    requestsPerMonth: 100000,
    maxResponseSize: '10MB',
    endpoints: ['*'],
  },
  professional: {
    requestsPerMinute: 500,
    requestsPerDay: 50000,
    requestsPerMonth: 500000,
    maxResponseSize: '50MB',
    endpoints: ['*'],
  },
  enterprise: {
    requestsPerMinute: 2000,
    requestsPerDay: -1,  // unlimited
    requestsPerMonth: -1,
    maxResponseSize: '100MB',
    endpoints: ['*'],
  },
};

// Quota check middleware
async function quotaMiddleware(req, res, next) {
  const apiKey = req.headers['x-api-key'];
  const client = await getClientByApiKey(apiKey);
  const tier = TIERS[client.tier];

  // Check all quota levels
  const checks = await Promise.all([
    checkRateLimit(`${apiKey}:minute`, tier.requestsPerMinute, 60),
    checkRateLimit(`${apiKey}:day`, tier.requestsPerDay, 86400),
    checkRateLimit(`${apiKey}:month`, tier.requestsPerMonth, 2592000),
  ]);

  const failed = checks.find(c => !c.allowed);
  if (failed) {
    return res.status(429).json({
      error: 'Quota exceeded',
      quotaType: failed.type,
      upgrade: `https://api.birjob.com/pricing`,
    });
  }

  next();
}

Part 6: Being a Good API Consumer

When consuming external APIs (as BirJob does with 80+ job sites), respecting rate limits is essential. Get banned, and you lose access entirely.

Exponential Backoff on 429

async function fetchWithRateLimit(url: string, maxRetries = 5): Promise<Response> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    const response = await fetch(url);

    if (response.status !== 429) {
      return response;
    }

    // Respect Retry-After header
    const retryAfter = response.headers.get('Retry-After');
    let waitMs: number;

    if (retryAfter) {
      // Could be seconds or an HTTP date
      waitMs = isNaN(Number(retryAfter))
        ? new Date(retryAfter).getTime() - Date.now()
        : Number(retryAfter) * 1000;
    } else {
      // Exponential backoff with jitter
      waitMs = Math.min(1000 * Math.pow(2, attempt) + Math.random() * 1000, 60000);
    }

    console.log(`Rate limited. Waiting ${waitMs}ms before retry ${attempt + 1}/${maxRetries}`);
    await new Promise(resolve => setTimeout(resolve, waitMs));
  }

  throw new Error(`Rate limit exceeded after ${maxRetries} retries`);
}

Request Scheduling

// Rate-limited request queue
class RequestScheduler {
  private queue: Array<{ fn: () => Promise<any>; resolve: Function; reject: Function }> = [];
  private processing = false;
  private requestsThisWindow = 0;
  private windowStart = Date.now();
  private maxPerWindow: number;
  private windowMs: number;

  constructor(maxPerWindow: number, windowMs: number) {
    this.maxPerWindow = maxPerWindow;
    this.windowMs = windowMs;
  }

  async schedule<T>(fn: () => Promise<T>): Promise<T> {
    return new Promise((resolve, reject) => {
      this.queue.push({ fn, resolve, reject });
      this.process();
    });
  }

  private async process() {
    if (this.processing) return;
    this.processing = true;

    while (this.queue.length > 0) {
      const now = Date.now();

      if (now - this.windowStart >= this.windowMs) {
        this.windowStart = now;
        this.requestsThisWindow = 0;
      }

      if (this.requestsThisWindow >= this.maxPerWindow) {
        const waitTime = this.windowMs - (now - this.windowStart);
        await new Promise(resolve => setTimeout(resolve, waitTime));
        continue;
      }

      const item = this.queue.shift()!;
      this.requestsThisWindow++;

      try {
        const result = await item.fn();
        item.resolve(result);
      } catch (error) {
        item.reject(error);
      }
    }

    this.processing = false;
  }
}

// Use in BirJob scraper: max 5 requests per second to any single site
const scheduler = new RequestScheduler(5, 1000);
const result = await scheduler.schedule(() => fetch('https://jobs.example.com/api/listings'));

Part 7: My Opinionated Take

Engineering decisions about API architecture

After building rate limiters and being rate-limited by dozens of APIs, here's what I've learned:

1. Start with the token bucket algorithm. It handles bursts gracefully and is intuitive. Unless you have a specific reason to choose something else, token bucket is the right default.

2. Always use Redis for distributed rate limiting. In-process rate limiting breaks the moment you have two servers. Even if you're on a single server today, use Redis from the start. The Redis documentation on rate limiting patterns provides production-ready solutions.

3. Rate limits should be per-endpoint, not just per-client. A GET /jobs endpoint might handle 1000 requests/minute, but a POST /jobs endpoint that triggers database writes might only handle 10/minute. Different endpoints have different costs.

4. Communicate rate limits clearly. Return remaining quota in headers, provide a useful error message, include a Retry-After header, and link to documentation. The difference between a frustrating API and a pleasant one is often just the error messages.

5. As a consumer: always respect Retry-After. Never retry immediately on a 429. Always implement exponential backoff. And always read the API documentation before starting — many providers will ban you permanently for repeated violations.

Action Plan

For API Providers

Implement token bucket rate limiting with Redis
Return standard rate limit headers on every response
Return clear 429 responses with Retry-After
Set up different limits per tier and per endpoint
Monitor rate limit hit rates — if legitimate users are frequently limited, your limits are too strict

For API Consumers

Read the API documentation for rate limit policies
Implement exponential backoff with jitter for 429 responses
Use a request scheduler to stay under limits proactively
Cache responses to reduce unnecessary API calls
Monitor your usage against quotas to avoid surprises

Sources

I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.

Loading BirJob...

The Complete Guide to API Rate Limiting and Quota Management

The Complete Guide to API Rate Limiting and Quota Management

Part 1: Why Rate Limiting Matters

Part 2: Rate Limiting Algorithms

1. Fixed Window Counter

2. Sliding Window Log

3. Sliding Window Counter

4. Token Bucket

5. Leaky Bucket

Algorithm Comparison

Part 3: Distributed Rate Limiting with Redis

Redis Token Bucket with Lua Script

Part 4: HTTP Headers and Client Communication

Response Format for 429 Errors

Part 5: Quota Management — Beyond Simple Rate Limits

Multi-Tier Quota System

Part 6: Being a Good API Consumer

Exponential Backoff on 429

Request Scheduling

Part 7: My Opinionated Take

Action Plan

For API Providers

For API Consumers

Sources

İş axtarışınıza başlayın

Oxşar məqalələr