The Complete Guide to API Rate Limiting and Quota Management
BirJob scrapes 80+ job listing websites daily. Every single one of those sites is an API of some kind — whether it's a formal REST API or an HTML page we parse. And every single one has rate limits, even if they don't document them. I learned this the hard way when our scraper IP got banned from three major job boards in one week because we were hitting them too aggressively.
But rate limiting isn't just about being a good API consumer. If you build APIs (and most developers do), you need to protect your own services from abuse, ensure fair usage across clients, and prevent a single misbehaving client from taking down your entire platform.
This guide covers both sides: implementing rate limiting in your APIs, and respecting rate limits when consuming others'. We'll go deep on algorithms, architectures, and the operational reality of managing API quotas at scale.
Part 1: Why Rate Limiting Matters
Without rate limiting, any publicly accessible API is vulnerable to:
- Denial of Service (DoS): A single client sending millions of requests can overwhelm your server
- Resource exhaustion: A buggy client in a retry loop can consume all your database connections
- Cost explosion: If you're on usage-based cloud pricing, runaway API calls translate directly to runaway costs
- Unfair usage: One heavy user degrades performance for everyone else
According to Cloudflare's analysis, over 30% of internet traffic is automated bot traffic, and without rate limiting, bots can easily consume 10x the resources of legitimate users.
Part 2: Rate Limiting Algorithms
1. Fixed Window Counter
The simplest algorithm. Divide time into fixed windows (e.g., 1-minute intervals) and count requests per window.
// Fixed Window implementation
class FixedWindowLimiter {
private counts: Map<string, { count: number; windowStart: number }> = new Map();
private windowSize: number; // in ms
private maxRequests: number;
constructor(windowSizeMs: number, maxRequests: number) {
this.windowSize = windowSizeMs;
this.maxRequests = maxRequests;
}
isAllowed(clientId: string): boolean {
const now = Date.now();
const windowStart = Math.floor(now / this.windowSize) * this.windowSize;
const entry = this.counts.get(clientId);
if (!entry || entry.windowStart !== windowStart) {
this.counts.set(clientId, { count: 1, windowStart });
return true;
}
if (entry.count >= this.maxRequests) {
return false;
}
entry.count++;
return true;
}
}
// 100 requests per minute
const limiter = new FixedWindowLimiter(60000, 100);
Pros: Simple to implement, low memory usage.
Cons: Boundary problem — a client can send 100 requests at 0:59 and 100 more at 1:00, effectively getting 200 requests in 2 seconds.
2. Sliding Window Log
Keeps a log of all request timestamps and counts how many fall within the current window.
class SlidingWindowLogLimiter {
private logs: Map<string, number[]> = new Map();
private windowSize: number;
private maxRequests: number;
constructor(windowSizeMs: number, maxRequests: number) {
this.windowSize = windowSizeMs;
this.maxRequests = maxRequests;
}
isAllowed(clientId: string): boolean {
const now = Date.now();
const windowStart = now - this.windowSize;
let timestamps = this.logs.get(clientId) || [];
// Remove expired entries
timestamps = timestamps.filter(t => t > windowStart);
if (timestamps.length >= this.maxRequests) {
this.logs.set(clientId, timestamps);
return false;
}
timestamps.push(now);
this.logs.set(clientId, timestamps);
return true;
}
}
Pros: Perfectly accurate, no boundary problem.
Cons: High memory usage (stores every timestamp). Not practical for high-traffic APIs.
3. Sliding Window Counter
A hybrid that approximates the sliding window using the current and previous window counts. This is what most production systems use, according to Cloudflare's engineering blog.
class SlidingWindowCounterLimiter {
private windows: Map<string, { current: number; previous: number; currentStart: number }> = new Map();
private windowSize: number;
private maxRequests: number;
constructor(windowSizeMs: number, maxRequests: number) {
this.windowSize = windowSizeMs;
this.maxRequests = maxRequests;
}
isAllowed(clientId: string): boolean {
const now = Date.now();
const currentWindow = Math.floor(now / this.windowSize) * this.windowSize;
let entry = this.windows.get(clientId);
if (!entry || currentWindow - entry.currentStart >= this.windowSize * 2) {
entry = { current: 0, previous: 0, currentStart: currentWindow };
this.windows.set(clientId, entry);
} else if (currentWindow !== entry.currentStart) {
entry.previous = entry.current;
entry.current = 0;
entry.currentStart = currentWindow;
}
// Weighted count: previous window weight based on elapsed time
const elapsed = now - currentWindow;
const previousWeight = 1 - (elapsed / this.windowSize);
const estimatedCount = entry.previous * previousWeight + entry.current;
if (estimatedCount >= this.maxRequests) {
return false;
}
entry.current++;
return true;
}
}
Pros: Good accuracy, low memory, handles boundary problem.
Cons: Approximate (but close enough for production).
4. Token Bucket
The most flexible algorithm. A bucket holds tokens; each request consumes a token. Tokens are added at a fixed rate. If the bucket is empty, requests are rejected. The bucket has a maximum capacity, allowing bursts up to that capacity.
class TokenBucketLimiter {
private buckets: Map<string, { tokens: number; lastRefill: number }> = new Map();
private maxTokens: number;
private refillRate: number; // tokens per millisecond
constructor(maxTokens: number, refillRatePerSecond: number) {
this.maxTokens = maxTokens;
this.refillRate = refillRatePerSecond / 1000;
}
isAllowed(clientId: string, tokensRequired: number = 1): boolean {
const now = Date.now();
let bucket = this.buckets.get(clientId);
if (!bucket) {
bucket = { tokens: this.maxTokens, lastRefill: now };
this.buckets.set(clientId, bucket);
}
// Refill tokens based on elapsed time
const elapsed = now - bucket.lastRefill;
bucket.tokens = Math.min(
this.maxTokens,
bucket.tokens + elapsed * this.refillRate
);
bucket.lastRefill = now;
if (bucket.tokens < tokensRequired) {
return false;
}
bucket.tokens -= tokensRequired;
return true;
}
}
// 100 tokens max, refills at 10 per second
// Allows bursts of 100, sustained rate of 10/s
const limiter = new TokenBucketLimiter(100, 10);
Pros: Allows bursts, configurable sustained rate, simple mental model.
Cons: Slightly more complex than fixed window.
5. Leaky Bucket
Similar to token bucket but with a queue. Requests are added to the bucket (queue) and processed at a fixed rate. If the bucket is full, new requests are rejected.
Pros: Smooth output rate (good for rate-limiting outgoing requests).
Cons: Adds latency (requests wait in queue). Less common for API rate limiting.
Algorithm Comparison
| Algorithm | Accuracy | Memory | Burst Handling | Complexity | Best For |
|---|---|---|---|---|---|
| Fixed Window | Low | Very Low | Allows 2x at boundary | Simple | Basic protection |
| Sliding Window Log | Perfect | High | No bursts | Medium | Low-traffic precise limits |
| Sliding Window Counter | Good | Low | Smooth | Medium | Production APIs |
| Token Bucket | Good | Low | Configurable bursts | Medium | APIs needing burst tolerance |
| Leaky Bucket | Good | Medium | No bursts (queued) | Medium | Smoothing outbound requests |
Part 3: Distributed Rate Limiting with Redis
In-memory rate limiting works for a single server. For multiple servers behind a load balancer, you need a shared store. Redis is the standard choice, thanks to its atomic operations and sub-millisecond latency.
Redis Token Bucket with Lua Script
-- rate_limit.lua (atomic operation)
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2]) -- tokens per second
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local bucket = redis.call('hmget', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])
if tokens == nil then
tokens = max_tokens
last_refill = now
end
-- Refill
local elapsed = (now - last_refill) / 1000
local new_tokens = math.min(max_tokens, tokens + elapsed * refill_rate)
-- Check
if new_tokens < requested then
-- Rejected: update tokens but don't consume
redis.call('hmset', key, 'tokens', new_tokens, 'last_refill', now)
redis.call('pexpire', key, math.ceil(max_tokens / refill_rate * 1000) + 1000)
return {0, math.ceil((requested - new_tokens) / refill_rate * 1000)}
end
-- Allowed: consume tokens
new_tokens = new_tokens - requested
redis.call('hmset', key, 'tokens', new_tokens, 'last_refill', now)
redis.call('pexpire', key, math.ceil(max_tokens / refill_rate * 1000) + 1000)
return {1, 0}
// Node.js usage
import Redis from 'ioredis';
import { readFileSync } from 'fs';
const redis = new Redis();
const luaScript = readFileSync('./rate_limit.lua', 'utf8');
async function checkRateLimit(
clientId: string,
maxTokens: number = 100,
refillRate: number = 10
): Promise<{ allowed: boolean; retryAfterMs: number }> {
const [allowed, retryAfter] = await redis.eval(
luaScript,
1, // number of keys
`rate_limit:${clientId}`, // KEYS[1]
maxTokens, // ARGV[1]
refillRate, // ARGV[2]
Date.now(), // ARGV[3]
1 // ARGV[4]
) as [number, number];
return {
allowed: allowed === 1,
retryAfterMs: retryAfter,
};
}
Part 4: HTTP Headers and Client Communication
Good rate limiting communicates clearly with clients. The IETF draft on rate limit headers standardizes these headers:
// Express middleware
function rateLimitMiddleware(req, res, next) {
const clientId = req.headers['x-api-key'] || req.ip;
const result = checkRateLimit(clientId);
// Set standard headers
res.setHeader('RateLimit-Limit', '100'); // Max requests per window
res.setHeader('RateLimit-Remaining', result.remaining); // Requests left
res.setHeader('RateLimit-Reset', result.resetTime); // Window reset time (Unix timestamp)
// Legacy headers (still widely used)
res.setHeader('X-RateLimit-Limit', '100');
res.setHeader('X-RateLimit-Remaining', result.remaining);
res.setHeader('X-RateLimit-Reset', result.resetTime);
if (!result.allowed) {
res.setHeader('Retry-After', Math.ceil(result.retryAfterMs / 1000));
return res.status(429).json({
error: 'Too Many Requests',
message: `Rate limit exceeded. Retry after ${result.retryAfterMs}ms`,
retryAfter: result.retryAfterMs,
});
}
next();
}
Response Format for 429 Errors
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1711708800
Retry-After: 30
{
"error": {
"code": "RATE_LIMIT_EXCEEDED",
"message": "You have exceeded the rate limit of 100 requests per minute.",
"retryAfter": 30,
"documentation": "https://api.birjob.com/docs/rate-limiting"
}
}
Part 5: Quota Management — Beyond Simple Rate Limits
Rate limiting (requests per second/minute) is the first layer. Quota management adds longer-term limits tied to subscription tiers, billing, and usage policies.
Multi-Tier Quota System
// Quota tiers
const TIERS = {
free: {
requestsPerMinute: 30,
requestsPerDay: 1000,
requestsPerMonth: 10000,
maxResponseSize: '1MB',
endpoints: ['GET /jobs', 'GET /jobs/:id'],
},
starter: {
requestsPerMinute: 100,
requestsPerDay: 10000,
requestsPerMonth: 100000,
maxResponseSize: '10MB',
endpoints: ['*'],
},
professional: {
requestsPerMinute: 500,
requestsPerDay: 50000,
requestsPerMonth: 500000,
maxResponseSize: '50MB',
endpoints: ['*'],
},
enterprise: {
requestsPerMinute: 2000,
requestsPerDay: -1, // unlimited
requestsPerMonth: -1,
maxResponseSize: '100MB',
endpoints: ['*'],
},
};
// Quota check middleware
async function quotaMiddleware(req, res, next) {
const apiKey = req.headers['x-api-key'];
const client = await getClientByApiKey(apiKey);
const tier = TIERS[client.tier];
// Check all quota levels
const checks = await Promise.all([
checkRateLimit(`${apiKey}:minute`, tier.requestsPerMinute, 60),
checkRateLimit(`${apiKey}:day`, tier.requestsPerDay, 86400),
checkRateLimit(`${apiKey}:month`, tier.requestsPerMonth, 2592000),
]);
const failed = checks.find(c => !c.allowed);
if (failed) {
return res.status(429).json({
error: 'Quota exceeded',
quotaType: failed.type,
upgrade: `https://api.birjob.com/pricing`,
});
}
next();
}
Part 6: Being a Good API Consumer
When consuming external APIs (as BirJob does with 80+ job sites), respecting rate limits is essential. Get banned, and you lose access entirely.
Exponential Backoff on 429
async function fetchWithRateLimit(url: string, maxRetries = 5): Promise<Response> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
const response = await fetch(url);
if (response.status !== 429) {
return response;
}
// Respect Retry-After header
const retryAfter = response.headers.get('Retry-After');
let waitMs: number;
if (retryAfter) {
// Could be seconds or an HTTP date
waitMs = isNaN(Number(retryAfter))
? new Date(retryAfter).getTime() - Date.now()
: Number(retryAfter) * 1000;
} else {
// Exponential backoff with jitter
waitMs = Math.min(1000 * Math.pow(2, attempt) + Math.random() * 1000, 60000);
}
console.log(`Rate limited. Waiting ${waitMs}ms before retry ${attempt + 1}/${maxRetries}`);
await new Promise(resolve => setTimeout(resolve, waitMs));
}
throw new Error(`Rate limit exceeded after ${maxRetries} retries`);
}
Request Scheduling
// Rate-limited request queue
class RequestScheduler {
private queue: Array<{ fn: () => Promise<any>; resolve: Function; reject: Function }> = [];
private processing = false;
private requestsThisWindow = 0;
private windowStart = Date.now();
private maxPerWindow: number;
private windowMs: number;
constructor(maxPerWindow: number, windowMs: number) {
this.maxPerWindow = maxPerWindow;
this.windowMs = windowMs;
}
async schedule<T>(fn: () => Promise<T>): Promise<T> {
return new Promise((resolve, reject) => {
this.queue.push({ fn, resolve, reject });
this.process();
});
}
private async process() {
if (this.processing) return;
this.processing = true;
while (this.queue.length > 0) {
const now = Date.now();
if (now - this.windowStart >= this.windowMs) {
this.windowStart = now;
this.requestsThisWindow = 0;
}
if (this.requestsThisWindow >= this.maxPerWindow) {
const waitTime = this.windowMs - (now - this.windowStart);
await new Promise(resolve => setTimeout(resolve, waitTime));
continue;
}
const item = this.queue.shift()!;
this.requestsThisWindow++;
try {
const result = await item.fn();
item.resolve(result);
} catch (error) {
item.reject(error);
}
}
this.processing = false;
}
}
// Use in BirJob scraper: max 5 requests per second to any single site
const scheduler = new RequestScheduler(5, 1000);
const result = await scheduler.schedule(() => fetch('https://jobs.example.com/api/listings'));
Part 7: My Opinionated Take
After building rate limiters and being rate-limited by dozens of APIs, here's what I've learned:
1. Start with the token bucket algorithm. It handles bursts gracefully and is intuitive. Unless you have a specific reason to choose something else, token bucket is the right default.
2. Always use Redis for distributed rate limiting. In-process rate limiting breaks the moment you have two servers. Even if you're on a single server today, use Redis from the start. The Redis documentation on rate limiting patterns provides production-ready solutions.
3. Rate limits should be per-endpoint, not just per-client. A GET /jobs endpoint might handle 1000 requests/minute, but a POST /jobs endpoint that triggers database writes might only handle 10/minute. Different endpoints have different costs.
4. Communicate rate limits clearly. Return remaining quota in headers, provide a useful error message, include a Retry-After header, and link to documentation. The difference between a frustrating API and a pleasant one is often just the error messages.
5. As a consumer: always respect Retry-After. Never retry immediately on a 429. Always implement exponential backoff. And always read the API documentation before starting — many providers will ban you permanently for repeated violations.
Action Plan
For API Providers
- Implement token bucket rate limiting with Redis
- Return standard rate limit headers on every response
- Return clear 429 responses with Retry-After
- Set up different limits per tier and per endpoint
- Monitor rate limit hit rates — if legitimate users are frequently limited, your limits are too strict
For API Consumers
- Read the API documentation for rate limit policies
- Implement exponential backoff with jitter for 429 responses
- Use a request scheduler to stay under limits proactively
- Cache responses to reduce unnecessary API calls
- Monitor your usage against quotas to avoid surprises
Sources
- Cloudflare: What Is Rate Limiting?
- Cloudflare Engineering: Counting Things
- IETF: Rate Limit Headers Draft
- Redis: Rate Limiting Patterns
- Stripe: Rate Limiting Best Practices
I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.
