The Complete Guide to Webhook Architecture and Reliability
Early in my career, I built a Stripe integration that processed payment webhooks. It worked perfectly — until it didn't. One Friday evening, Stripe sent us a batch of 200 webhooks in 30 seconds (a customer with a complex subscription). Our webhook endpoint took ~500ms per request (database writes, email sending, inventory updates — all synchronous). The server's connection pool filled up, subsequent webhooks timed out, Stripe retried them, which created more load, and we entered a death spiral. By the time I noticed, we had processed some events twice, missed others entirely, and our database had inconsistent order states.
That incident cost us a weekend of data reconciliation and taught me the hard way that webhooks are deceptively simple to implement and deceptively hard to get right. They look like regular HTTP endpoints, but they have unique reliability requirements that most web application patterns don't prepare you for.
According to Svix's 2024 webhook reliability research, the average webhook consumer experiences a 3.5% failure rate, and 15% of webhook implementations have no retry mechanism at all. This guide covers everything from basic implementation to production-grade reliability patterns used by companies processing millions of webhooks daily.
What Webhooks Actually Are (And Aren't)
A webhook is an HTTP callback — a POST request sent from one system to another when an event occurs. Instead of your application polling an API ("any new orders?"), the API calls you when something happens ("here's a new order").
| Approach | How It Works | Latency | Cost | Reliability |
|---|---|---|---|---|
| Polling | GET /api/events every N seconds | N seconds (polling interval) | High (wasted API calls) | High (you control the loop) |
| Webhooks | POST to your endpoint on event | Near real-time (~1-5s) | Low (only on events) | Medium (delivery not guaranteed) |
| WebSockets | Persistent bidirectional connection | Real-time (~ms) | Medium (connection overhead) | Low (connection drops) |
Webhooks are the standard integration pattern for SaaS APIs. Postman's 2024 State of API Report found that 82% of API providers offer webhooks, making them the most common real-time integration mechanism.
Receiving Webhooks: The Right Way
Rule #1: Return 200 Immediately, Process Later
The most important rule of webhook handling: acknowledge receipt immediately (return HTTP 200), then process the event asynchronously. Never do heavy processing inside the webhook handler — the sender has a timeout (typically 5-30 seconds), and if you don't respond in time, they'll retry.
// BAD: Synchronous processing in the webhook handler
app.post('/webhooks/stripe', async (req, res) => {
const event = req.body;
await updateDatabase(event); // 50ms
await sendConfirmationEmail(event); // 2000ms
await updateInventory(event); // 300ms
await notifyAnalytics(event); // 100ms
res.status(200).send('OK'); // Total: ~2.5s — too slow
});
// GOOD: Acknowledge immediately, process async
app.post('/webhooks/stripe', async (req, res) => {
const event = req.body;
// Verify signature first (security — must be sync)
if (!verifyStripeSignature(req)) {
return res.status(401).send('Invalid signature');
}
// Store raw event for processing
await db.query(
'INSERT INTO webhook_events (id, source, payload, status) VALUES ($1, $2, $3, $4)',
[event.id, 'stripe', JSON.stringify(event), 'pending']
);
// Acknowledge immediately
res.status(200).send('OK');
// Process asynchronously (via queue, not in-line)
await queue.publish('webhook-processing', { eventId: event.id });
});
Rule #2: Always Verify Webhook Signatures
Without signature verification, anyone who knows your webhook URL can send fake events. Every reputable webhook provider includes a signature in the request headers.
// Stripe signature verification
const stripe = require('stripe')(process.env.STRIPE_SECRET_KEY);
function verifyStripeSignature(req) {
const sig = req.headers['stripe-signature'];
try {
stripe.webhooks.constructEvent(
req.rawBody, // Must use raw body, not parsed JSON
sig,
process.env.STRIPE_WEBHOOK_SECRET
);
return true;
} catch (err) {
console.error('Webhook signature verification failed:', err.message);
return false;
}
}
// GitHub signature verification (HMAC-SHA256)
const crypto = require('crypto');
function verifyGitHubSignature(req) {
const signature = req.headers['x-hub-signature-256'];
const hmac = crypto.createHmac('sha256', process.env.GITHUB_WEBHOOK_SECRET);
const digest = 'sha256=' + hmac.update(req.rawBody).digest('hex');
return crypto.timingSafeEqual(Buffer.from(signature), Buffer.from(digest));
}
Rule #3: Idempotent Processing
Webhook providers send events at-least-once. Network issues, timeouts, and retries mean you will receive duplicate events. Your processing must be idempotent — processing the same event twice should have no additional effect.
// Idempotent webhook processor
async function processWebhookEvent(eventId) {
// Atomic: try to claim the event for processing
const result = await db.query(`
UPDATE webhook_events
SET status = 'processing', started_at = NOW()
WHERE id = $1 AND status = 'pending'
RETURNING *
`, [eventId]);
if (result.rows.length === 0) {
// Event already processed or being processed
logger.info({ eventId }, 'Event already processed, skipping');
return;
}
const event = result.rows[0];
try {
// Idempotent business logic
await processPayment(JSON.parse(event.payload));
await db.query(
"UPDATE webhook_events SET status = 'completed', completed_at = NOW() WHERE id = $1",
[eventId]
);
} catch (err) {
await db.query(
"UPDATE webhook_events SET status = 'failed', error = $2, failed_at = NOW() WHERE id = $1",
[eventId, err.message]
);
throw err; // Let the queue handle retry
}
}
Sending Webhooks: Building a Reliable Delivery System
If you're building a platform that sends webhooks to customers, reliability is your responsibility. Users depend on your webhooks for critical business logic. Here's the architecture I recommend:
The Delivery Pipeline
// Step 1: Create the webhook event
async function emitWebhook(eventType, payload, tenantId) {
const event = {
id: randomUUID(),
type: eventType,
data: payload,
created_at: new Date().toISOString(),
api_version: '2026-03-01'
};
// Get all active webhook endpoints for this tenant
const endpoints = await db.query(
'SELECT * FROM webhook_endpoints WHERE tenant_id = $1 AND active = true',
[tenantId]
);
// Create a delivery attempt for each endpoint
for (const endpoint of endpoints.rows) {
await db.query(`
INSERT INTO webhook_deliveries (id, endpoint_id, event_id, payload, status, attempt_count)
VALUES ($1, $2, $3, $4, 'pending', 0)
`, [randomUUID(), endpoint.id, event.id, JSON.stringify(event)]);
}
// Enqueue for delivery
await queue.publish('webhook-delivery', { eventId: event.id });
}
// Step 2: Deliver with retry
async function deliverWebhook(deliveryId) {
const delivery = await db.query('SELECT * FROM webhook_deliveries WHERE id = $1', [deliveryId]);
const endpoint = await db.query('SELECT * FROM webhook_endpoints WHERE id = $1', [delivery.endpoint_id]);
const payload = JSON.parse(delivery.payload);
// Sign the payload
const signature = crypto
.createHmac('sha256', endpoint.secret)
.update(JSON.stringify(payload))
.digest('hex');
try {
const response = await fetch(endpoint.url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-Webhook-Signature': `sha256=${signature}`,
'X-Webhook-ID': delivery.id,
'X-Webhook-Timestamp': new Date().toISOString()
},
body: JSON.stringify(payload),
signal: AbortSignal.timeout(10000) // 10s timeout
});
if (response.ok) {
await db.query(
"UPDATE webhook_deliveries SET status = 'delivered', delivered_at = NOW(), http_status = $2 WHERE id = $1",
[deliveryId, response.status]
);
} else {
throw new Error(`HTTP ${response.status}`);
}
} catch (err) {
await handleDeliveryFailure(deliveryId, err);
}
}
Exponential Backoff Retry Schedule
When delivery fails, retry with increasing delays. The industry standard (used by Stripe, GitHub, and Shopify) is exponential backoff:
| Attempt | Delay | Time Since First Attempt |
|---|---|---|
| 1 | Immediate | 0 |
| 2 | 5 minutes | 5 min |
| 3 | 30 minutes | 35 min |
| 4 | 2 hours | 2h 35min |
| 5 | 8 hours | 10h 35min |
| 6 | 24 hours | 34h 35min |
| 7+ | Disabled after 72h | - |
// Retry with exponential backoff
const RETRY_DELAYS = [0, 300, 1800, 7200, 28800, 86400]; // seconds
async function handleDeliveryFailure(deliveryId, error) {
const delivery = await db.query('SELECT * FROM webhook_deliveries WHERE id = $1', [deliveryId]);
const attemptCount = delivery.attempt_count + 1;
if (attemptCount >= RETRY_DELAYS.length) {
// Max retries exceeded — mark as failed, alert
await db.query(
"UPDATE webhook_deliveries SET status = 'failed', attempt_count = $2 WHERE id = $1",
[deliveryId, attemptCount]
);
await alertEndpointOwner(delivery.endpoint_id, 'Webhook delivery permanently failed');
return;
}
const nextRetryDelay = RETRY_DELAYS[attemptCount];
await db.query(
"UPDATE webhook_deliveries SET status = 'retrying', attempt_count = $2, next_retry_at = NOW() + interval '$3 seconds' WHERE id = $1",
[deliveryId, attemptCount, nextRetryDelay]
);
// Schedule retry
await queue.publishDelayed('webhook-delivery', { deliveryId }, nextRetryDelay * 1000);
}
Security Best Practices
Webhooks open an HTTP endpoint on your server that receives data from external sources. This introduces security risks that must be mitigated:
- Always verify signatures. Never process a webhook without verifying its cryptographic signature. Use
crypto.timingSafeEqualfor comparison to prevent timing attacks. - Validate the payload. Even after signature verification, validate the payload structure. A compromised signing key could send valid signatures with malicious payloads.
- Use HTTPS only. Webhook URLs should always be HTTPS. Reject HTTP URLs in your webhook registration API.
- Implement rate limiting. Protect your webhook endpoint from abuse. If an attacker discovers your URL, they could flood it with requests.
- Timestamp validation. Reject webhooks with timestamps older than 5 minutes. This prevents replay attacks where an attacker resends a captured webhook.
- IP allowlisting (optional). Some providers publish their IP ranges (Stripe, GitHub). Allowlisting these IPs adds an extra layer of protection.
My Opinionated Webhook Rules
1. The webhook endpoint should do exactly two things: verify the signature and store the event. Everything else happens asynchronously. This is the golden rule that prevents the death spiral I described at the beginning.
2. Build a webhook event log from day one. Store every webhook event you receive, whether you process it successfully or not. This is your audit trail, your debugging tool, and your data reconciliation safety net.
3. If you're building a platform, use a webhook delivery service. Svix, Hookdeck, and ngrok provide managed webhook delivery with retries, monitoring, and debugging tools. Building reliable webhook delivery from scratch takes months of engineering time.
4. Always implement the "reconciliation endpoint." Alongside your webhook handler, build an API endpoint that lets you fetch events you might have missed. Webhooks should be the primary notification mechanism, but the reconciliation endpoint is your safety net.
5. Monitor webhook processing lag. The time between when an event is received and when it's fully processed is a critical metric. If it exceeds your SLA (typically < 60 seconds), you need to scale your processing pipeline.
Action Plan: Production Webhooks in 3 Weeks
Week 1: Receiving Webhooks
- Create the webhook endpoint with signature verification
- Implement the "store and process async" pattern
- Build the idempotent event processor with deduplication
- Set up the webhook event log table
Week 2: Sending Webhooks (if applicable)
- Design the webhook delivery pipeline with retry logic
- Implement HMAC signature generation
- Build the endpoint registration and management API
- Set up exponential backoff retry schedule
Week 3: Monitoring and Hardening
- Add monitoring for delivery success rate, processing lag, and retry counts
- Set up alerts for failed deliveries and endpoint downtime
- Build a webhook testing tool (for endpoint owners to test their integrations)
- Document the webhook API with event types, payload formats, and retry behavior
Sources and Further Reading
- Svix — Webhook Failure Statistics 2024
- Postman — 2024 State of API Report
- Stripe — Webhook Documentation
- GitHub — Webhooks Documentation
- Svix — Managed Webhook Delivery
- Hookdeck — Webhook Infrastructure
- webhooks.fyi — Webhook Best Practices Directory
- Standard Webhooks — Open Specification
I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.
