Rate Limiting, Throttling, and Backpressure: A Developer's Guide
I once watched a partner integration take down our entire API. Not a DDoS attack. Not a bug. Just one enthusiastic client running a data sync without any delay between requests. They sent 12,000 API calls in 45 seconds. Our database connection pool was exhausted in under a minute. Every other customer got 503 errors for the next ten minutes while we scrambled to recover.
That day, I learned a lesson that every developer learns eventually: if you don't control how traffic enters your system, someone else will control it for you, usually at the worst possible time.
Rate limiting, throttling, and backpressure are three related but distinct mechanisms for managing traffic flow. They're often conflated, sometimes misunderstood, and frequently implemented incorrectly. This guide breaks down each one, shows you when to use which, and gives you practical implementations you can deploy today.
Definitions: Getting the Terminology Right
Before we go deeper, let's be precise about what these terms mean:
| Mechanism | What It Does | Who Enforces It | Response to Excess |
|---|---|---|---|
| Rate Limiting | Caps the number of requests in a time window | Server / API Gateway | Reject (HTTP 429) |
| Throttling | Slows down request processing | Server / Client | Queue or delay |
| Backpressure | Signals upstream to slow down | Consumer / Downstream | Propagate signal upstream |
Rate limiting says "no more than 100 requests per minute." Throttling says "I'll process your request, but you might have to wait." Backpressure says "I'm overwhelmed, please slow down." They work at different layers and solve different problems.
Rate Limiting Algorithms: The Core Four
1. Fixed Window Counter
The simplest approach. Divide time into fixed windows (e.g., 1-minute intervals). Count requests per window. Reject when the count exceeds the limit.
Problem: Burst at window boundaries. A client can send 100 requests at 11:59:59 and another 100 at 12:00:00, effectively sending 200 requests in 2 seconds while staying within the "100 per minute" limit for both windows.
Use when: You need something simple and the boundary burst problem is acceptable.
2. Sliding Window Log
Store the timestamp of every request. When a new request arrives, remove timestamps older than the window size, then count remaining entries. Reject if the count exceeds the limit.
Problem: Memory-intensive. Storing timestamps for every request for every user adds up quickly. If you have 10,000 users each making 100 requests per minute, that's 1 million timestamps in memory.
Use when: You need precise rate limiting and memory isn't a constraint.
3. Sliding Window Counter
A hybrid approach. Use two fixed windows and calculate a weighted count based on the current position within the window. If the current position is 30% into the window, the count is: (current window count) + (previous window count * 70%).
Advantage: Good accuracy with low memory overhead. This is what most production systems use.
Use when: You want a good balance of accuracy and performance. This should be your default choice.
4. Token Bucket
Imagine a bucket that holds tokens. Tokens are added at a fixed rate (e.g., 10 per second). Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, which allows short bursts.
Advantage: Naturally handles bursts up to the bucket capacity while enforcing a long-term average rate. This is what Stripe, GitHub, and most major APIs use.
Use when: You want to allow burst traffic while maintaining a steady average rate.
| Algorithm | Memory | Accuracy | Burst Handling | Complexity |
|---|---|---|---|---|
| Fixed Window | Very Low | Low | Poor | Very Low |
| Sliding Window Log | High | Very High | Good | Medium |
| Sliding Window Counter | Low | High | Good | Medium |
| Token Bucket | Very Low | High | Excellent | Low |
Implementing Rate Limiting with Redis
Redis is the de facto standard for distributed rate limiting. Its atomic operations and built-in expiration make it ideal. Here's a sliding window counter implementation:
The key insight is using Redis's MULTI/EXEC for atomicity. The pattern is:
- Increment the counter for the current window
- Set expiration on the key (so old windows are automatically cleaned up)
- Get the counter for the previous window
- Calculate the weighted count
- Decide: allow or reject
For the token bucket algorithm, Redis's EVALSHA with a Lua script is the way to go. The Lua script runs atomically on the Redis server, preventing race conditions between checking the bucket and consuming tokens.
According to Redis's official documentation, a single Redis instance can handle over 100,000 rate limit checks per second, which is sufficient for most applications. For higher scale, Redis Cluster distributes the load across multiple nodes.
Rate Limiting at Different Layers
API Gateway Level
This is your first line of defense. Tools like Kong, Nginx, and cloud providers' API gateways (AWS API Gateway, GCP Cloud Endpoints) offer built-in rate limiting. Configure it here and you protect your entire backend without any code changes.
AWS API Gateway, for example, allows you to set a rate of 10,000 requests per second with a burst capacity of 5,000. According to AWS documentation, these limits can be configured per API key, per stage, and per method.
Application Level
For more nuanced rate limiting (per user, per endpoint, per pricing tier), implement it in your application code. Libraries like rate-limiter-flexible for Node.js, django-ratelimit for Python, and Rack::Attack for Ruby make this straightforward.
Database Level
Often overlooked but critical. Connection pools are a form of rate limiting. PostgreSQL's max_connections setting (default 100) is a hard rate limit on concurrent database access. PgBouncer sits in front of PostgreSQL and provides connection pooling with configurable limits.
Throttling: Slowing Down Instead of Rejecting
Sometimes rejecting requests is too harsh. Throttling offers a gentler alternative: slow down the processing instead of refusing it entirely.
Server-Side Throttling
Request queuing: Instead of returning 429, queue excess requests and process them at a controlled rate. This works well for background jobs and batch processing but is problematic for real-time user requests where latency matters.
Priority-based throttling: Assign priority to requests. When the system is under load, process high-priority requests normally while delaying or queuing low-priority ones. For example, a SaaS application might prioritize paying customers over free-tier users during peak load.
Client-Side Throttling
Good API clients throttle themselves. This is both a courtesy and a practical necessity, because if you don't, the server will rate limit you anyway, and 429 errors are more disruptive than a slight delay.
Exponential backoff with jitter is the standard approach for retries after throttling. The formula: delay = min(cap, base * 2^attempt) + random(0, jitter). The jitter prevents the "thundering herd" problem where many clients retry at the same time.
Google's SRE book recommends a more sophisticated approach called "adaptive throttling" where the client tracks its own error rate and progressively reduces its sending rate as errors increase.
Backpressure: The Often-Ignored Third Pillar
Backpressure is fundamentally different from rate limiting and throttling. Instead of the server telling the client "you're sending too much," it's a downstream component telling an upstream component "I can't keep up." The signal flows backward through the system, hence "back" pressure.
Where Backpressure Matters
- Message queues: When a Kafka consumer can't process messages fast enough, the consumer lag grows. This is a backpressure signal.
- Stream processing: In systems like Apache Flink or Kafka Streams, backpressure is a first-class concept. When a processing stage is slow, it signals upstream stages to slow down.
- HTTP/2: The protocol has built-in flow control. A receiver can signal to the sender to stop sending data by adjusting the flow control window.
- TCP: TCP's sliding window protocol is the original backpressure mechanism. When the receiver's buffer is full, it tells the sender to stop.
Backpressure Strategies
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Drop oldest | Discard the oldest items in the buffer | Keeps most recent data | Data loss |
| Drop newest | Reject new items when buffer is full | Simple, preserves order | Data loss |
| Block producer | Stop the producer until space is available | No data loss | Can cascade upstream |
| Buffer to disk | Spill excess to disk storage | No data loss, no blocking | Slower, disk management |
| Sample | Only process a percentage of items | Controlled degradation | Incomplete data |
Reactive programming frameworks like Reactive Streams (Java), RxJS (JavaScript), and Project Reactor (Spring) have backpressure built into their APIs. If you're using these frameworks, you get backpressure handling "for free" by using their operators correctly.
Real-World Rate Limiting: How the Big Players Do It
| API Provider | Rate Limit | Algorithm | Notable Feature |
|---|---|---|---|
| GitHub | 5,000 req/hr (authenticated) | Token Bucket | X-RateLimit-* headers |
| Stripe | 100 req/sec (live), 25 req/sec (test) | Token Bucket | Idempotency keys for retries |
| Twitter/X | Varies by endpoint (15-900 per 15 min) | Fixed Window | Per-endpoint limits |
| OpenAI | Varies by tier and model | Token Bucket | Both RPM and TPM limits |
| Shopify | 40 req/sec (Plus), 2 req/sec (basic) | Leaky Bucket | Tier-based limits |
Notice a pattern: every major API uses clear response headers to communicate rate limit status. At minimum, you should return:
X-RateLimit-Limit: Maximum requests allowedX-RateLimit-Remaining: Requests remaining in the current windowX-RateLimit-Reset: Unix timestamp when the window resetsRetry-After: Seconds to wait before retrying (on 429 responses)
My Opinionated Take
After years of building and consuming rate-limited APIs, here are my strong opinions:
1. Rate limiting is not optional. Every API that will be consumed by external clients needs rate limiting from day one. Not "when we scale." Not "when we have abuse." Day one. I've seen startups without rate limiting get accidentally DDoS'd by a single partner integration more times than I can count.
2. The token bucket is the right default. Unless you have a specific reason to use something else, start with a token bucket. It handles bursts gracefully, it's intuitive to explain to API consumers, and it's easy to implement with Redis.
3. Client-side throttling should be mandatory. If you're building an API client library (SDK), build in automatic retry with exponential backoff. Don't make developers implement it themselves. They won't, or they'll do it wrong.
4. Backpressure is the most underused mechanism. Most web applications have no backpressure signals. A slow database query doesn't cause the API to stop accepting requests. A full message queue doesn't cause producers to slow down. Adding backpressure signals, even simple ones like monitoring queue depth, can prevent cascading failures.
5. Rate limits should be tier-based from the start. Even if you only have one tier today, structure your rate limiting code to support multiple tiers. When you add a paid plan (and you will), you don't want to refactor your rate limiting infrastructure.
Action Plan: Implementing Rate Limiting in Your API
Phase 1: Basic Protection (Day 1)
- Add rate limiting at your API gateway or reverse proxy (Nginx, Cloudflare, AWS API Gateway)
- Set a generous global limit (e.g., 1,000 requests per minute per IP)
- Return proper 429 responses with Retry-After headers
- Add monitoring: alert when any client hits the rate limit
Phase 2: Per-User Limits (Week 1-2)
- Implement token bucket rate limiting in your application code with Redis
- Set per-user limits based on authentication level
- Add X-RateLimit-* response headers to all API responses
- Document your rate limits in your API documentation
Phase 3: Granular Controls (Month 1)
- Add per-endpoint rate limits (write endpoints get lower limits than read endpoints)
- Implement tier-based limits (free vs. paid)
- Add client-side throttling to your SDK/client libraries
- Build a rate limit dashboard showing usage patterns
Phase 4: Backpressure (Month 2-3)
- Add health checks that monitor downstream dependency load
- Implement circuit breakers for external service calls
- Add queue depth monitoring with alerts
- Consider adaptive throttling based on system load
Key Takeaways
- Rate limiting rejects excess traffic; throttling slows it down; backpressure signals upstream to reduce flow.
- Token bucket is the best default algorithm for API rate limiting.
- Redis is the standard backend for distributed rate limiting.
- Always return rate limit headers so clients can self-regulate.
- Implement client-side throttling (exponential backoff with jitter) in your SDK.
- Backpressure is the least implemented but most valuable mechanism for preventing cascading failures.
- Start with gateway-level rate limiting on day one, then add application-level granularity.
Sources
- Redis Rate Limiter Pattern
- Stripe Rate Limits Documentation
- GitHub REST API Rate Limits
- AWS Architecture Blog - Exponential Backoff and Jitter
- Google SRE Book - Handling Overload
- AWS API Gateway Throttling
- Reactive Streams Specification
I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.
