Why Webhooks Fail
Last year, I watched a payment processing team lose 14 hours of webhook events because their retry logic
was a bare setTimeout with a fixed 30-second delay. The receiving service went down for a
database migration, came back up after 20 minutes, and got slammed with 40,000 queued retries at the exact
same instant. The service went down again. Then the retries fired again. It took a manual intervention to
break the cycle.
Webhooks fail constantly, and for reasons you cannot predict. Network partitions sever connections mid-flight. Server restarts during deployments return 502s for 10-30 seconds. Rate limiters reject bursts that arrive too fast. DNS propagation leaves stale records for minutes. TLS certificates expire at midnight on a Saturday. A downstream dependency returns 500s because its database is having a bad day.
The failure is not the problem. Every distributed system fails. The problem is what happens next.
Exponential Backoff with Jitter
The textbook answer is exponential backoff: wait 1 second, then 2, then 4, then 8. Double the delay each time. This works reasonably well for a single client retrying a single request. It falls apart when you have thousands of webhooks failing at the same time.
Imagine 5,000 webhook deliveries fail at 14:00:00 because the receiving server restarted. With pure exponential backoff, all 5,000 retry at 14:00:01. They all fail again. Then all 5,000 retry at 14:00:03. Fail again. Then 14:00:07. You have created a thundering herd, synchronized waves of traffic that hit the recovering server in coordinated bursts. I have seen this take down services that were otherwise perfectly healthy.
The fix is jitter. Instead of retrying at exactly baseDelay * 2^attempt, you randomize the
delay within a range. There are two common approaches. Full jitter picks a random value
between 0 and the calculated delay. This gives the widest spread but occasionally retries very quickly.
Equal jitter takes half the calculated delay as a fixed minimum and randomizes the other
half, which is a good middle ground that prevents both instant retries and excessively long waits. I prefer full
jitter for webhook systems because the wider distribution does a better job of desynchronizing traffic.
interface RetryConfig {
baseDelayMs: number
maxDelayMs: number
maxAttempts: number
jitterStrategy: 'full' | 'equal'
}
function calculateRetryDelay(attempt: number, config: RetryConfig): number {
const exponentialDelay = config.baseDelayMs * Math.pow(2, attempt)
const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs)
if (config.jitterStrategy === 'full') {
// Full jitter: uniform random between 0 and cappedDelay
return Math.floor(Math.random() * cappedDelay)
}
// Equal jitter: half fixed, half random
const halfDelay = cappedDelay / 2
return Math.floor(halfDelay + Math.random() * halfDelay)
}
async function deliverWithRetry(
webhookUrl: string,
payload: string,
config: RetryConfig
): Promise<{ success: boolean; attempts: number; lastStatus?: number }> {
for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
try {
const response = await fetch(webhookUrl, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: payload,
signal: AbortSignal.timeout(10_000),
})
if (response.ok) {
return { success: true, attempts: attempt + 1, lastStatus: response.status }
}
if (!isRetryableStatus(response.status)) {
return { success: false, attempts: attempt + 1, lastStatus: response.status }
}
// Handle 429 with Retry-After header
if (response.status === 429) {
const retryAfter = parseRetryAfterHeader(response.headers.get('Retry-After'))
if (retryAfter) {
await sleep(retryAfter)
continue
}
}
} catch (err) {
// Network errors and timeouts are always retryable
}
if (attempt < config.maxAttempts - 1) {
const delay = calculateRetryDelay(attempt, config)
await sleep(delay)
}
}
return { success: false, attempts: config.maxAttempts }
}
function sleep(ms: number): Promise {
return new Promise(resolve => setTimeout(resolve, ms))
}
A reasonable starting configuration: 1-second base delay, 1-hour max delay, 8 max attempts, full jitter. That gives you a retry window of roughly 4 hours with decreasing attempt frequency.
Which Status Codes Deserve a Retry
Not every failure should trigger a retry. A common mistake is retrying everything that is not a 200.
This wastes resources and can cause real harm. Imagine retrying a POST /charges webhook
to an endpoint that returns 400 because the payload format changed. You will retry forever and never succeed.
5xx status codes (500, 502, 503, 504) are almost always retryable. They indicate server-side problems that are typically transient: overloaded processes, gateway timeouts, service restarts.
4xx status codes are almost never retryable. A 400 means the request is malformed. A 401 means authentication failed. A 404 means the endpoint does not exist. A 410 means it was deliberately removed. Retrying these is pointless because the same request will get the same response.
The exception is 429 Too Many Requests. This means the server understood the request but
is asking you to slow down. Always check the Retry-After header first. If present, wait
exactly that long. If absent, fall back to your normal backoff schedule but add extra delay, since the server is
already under pressure.
function isRetryableStatus(status: number): boolean {
if (status === 429) return true
if (status >= 500 && status < 600) return true
return false
}
function parseRetryAfterHeader(header: string | null): number | null {
if (!header) return null
// Could be seconds: "120"
const seconds = parseInt(header, 10)
if (!isNaN(seconds)) return seconds * 1000
// Could be a date: "Wed, 21 Oct 2025 07:28:00 GMT"
const date = new Date(header)
if (!isNaN(date.getTime())) {
const delayMs = date.getTime() - Date.now()
return delayMs > 0 ? delayMs : null
}
return null
}
Idempotency Keys: Making Retries Safe
Here is a scenario that has burned every team I have worked with at least once. You send a
payment.completed webhook. The receiver processes it, credits the user's account, and starts
sending back a 200 response. The connection drops before you receive the response. From your side, the
delivery failed. You retry. The receiver processes the payment again. The user gets credited twice.
Retries inherently create the possibility of duplicate processing. The sender cannot distinguish between "the receiver never got it" and "the receiver got it but I never got the acknowledgment." The solution is idempotency keys: a unique identifier attached to each webhook event that lets the receiver detect and skip duplicates.
The sender should include this key in a header like X-Webhook-Id or in the payload body.
The receiver stores every processed key and checks incoming webhooks against that store before doing any work.
In practice, I recommend storing keys with a TTL of 7 days. That is long enough to cover any retry window, short
enough to avoid unbounded storage growth.
async function handleWebhook(req: Request): Promise {
const idempotencyKey = req.headers.get('X-Webhook-Id')
if (!idempotencyKey) {
return new Response('Missing idempotency key', { status: 400 })
}
// Check if already processed (use Redis with TTL or a DB table)
const alreadyProcessed = await redis.get(`webhook:processed:${idempotencyKey}`)
if (alreadyProcessed) {
// Return 200 so the sender stops retrying
return new Response('Already processed', { status: 200 })
}
// Process the webhook
const body = await req.json()
await processEvent(body)
// Mark as processed with a 7-day TTL
await redis.set(`webhook:processed:${idempotencyKey}`, '1', 'EX', 604_800)
return new Response('OK', { status: 200 })
}
A subtle but critical detail: set the processed flag after successful processing, not before. If you mark it as processed and then crash during processing, the event is lost permanently. If you process first and crash before marking, the worst case is a duplicate, which your business logic should handle gracefully anyway.
Dead Letter Queues
Every retry policy has a limit. After 8 attempts over 4 hours, if the endpoint is still down, you have to stop. But that webhook event still matters. A payment still happened. An order was still placed. Dropping it silently is not acceptable.
A dead letter queue (DLQ) captures webhook events that have exhausted all retry attempts. Instead of vanishing, they land in a separate queue where they can be inspected, debugged, and reprocessed once the underlying issue is resolved. The DLQ acts as a safety net that decouples "giving up on automatic delivery" from "losing the event forever."
In practice, a DLQ consumer should do three things: persist the failed event with full context (original payload, all attempt timestamps, the last error), send an alert to the team responsible for the webhook integration, and expose a mechanism for manual or bulk replay. I have seen teams build simple admin dashboards that let operators filter DLQ events by endpoint, error type, or time range, then replay them with a single click.
Circuit Breakers
Retry logic handles transient failures. But what about sustained failures? If an endpoint has been returning 503s for the last 200 consecutive attempts, continuing to send traffic is wasteful and potentially harmful. You are consuming your own resources and adding load to an already struggling service.
A circuit breaker tracks failure rates per endpoint and transitions between three states. Closed means everything is normal and requests flow through. Open means the endpoint is considered down, so all requests are immediately failed without making a network call. Half-open allows a single probe request through to test if the endpoint has recovered. If the probe succeeds, the circuit closes. If it fails, the circuit opens again.
class CircuitBreaker {
private state: 'closed' | 'open' | 'half-open' = 'closed'
private failureCount = 0
private lastFailureTime = 0
private readonly failureThreshold: number
private readonly resetTimeoutMs: number
constructor(failureThreshold = 5, resetTimeoutMs = 60_000) {
this.failureThreshold = failureThreshold
this.resetTimeoutMs = resetTimeoutMs
}
async execute(fn: () => Promise): Promise {
if (this.state === 'open') {
if (Date.now() - this.lastFailureTime >= this.resetTimeoutMs) {
this.state = 'half-open'
} else {
throw new Error('Circuit is open, endpoint is unavailable')
}
}
try {
const result = await fn()
this.onSuccess()
return result
} catch (err) {
this.onFailure()
throw err
}
}
private onSuccess(): void {
this.failureCount = 0
this.state = 'closed'
}
private onFailure(): void {
this.failureCount++
this.lastFailureTime = Date.now()
if (this.failureCount >= this.failureThreshold) {
this.state = 'open'
}
}
}
The tricky part is choosing thresholds. Too aggressive (open after 2 failures) and you will stop delivering webhooks during normal blips, since a single timeout can trigger the breaker. Too lenient (open after 50 failures) and you waste resources hammering a dead endpoint. I start with a threshold of 5 consecutive failures and a 60-second reset timeout for most webhook delivery systems, then adjust based on observed traffic patterns.
Retry Budgets
Per-endpoint circuit breakers solve the problem of hammering a single dead endpoint. But what about system-wide retry storms? If 30% of your endpoints start failing simultaneously, maybe due to a shared infrastructure issue on the receiving side or a regional cloud outage, your retry traffic can exceed your normal delivery traffic. Your system spends more resources on retries than on actual new webhook deliveries.
A retry budget caps retry traffic as a percentage of total traffic. Google's SRE book recommends limiting retries to 10% of total requests over a rolling window. If your system delivers 10,000 webhooks per minute and is already retrying 1,000 of them, new retries are dropped (and routed to the DLQ) until the ratio falls back below the threshold.
This is a coarse-grained protection mechanism that prevents cascading failure in your own infrastructure. Without a retry budget, a large-scale downstream outage can turn your webhook delivery system into a self-inflicted DDoS against yourself. The retry budget says: "We will make a reasonable effort to redeliver, but we will not sacrifice system health to do it."
How Stripe, GitHub, and Shopify Handle Retries
Looking at how major platforms handle this is instructive, because each has made different tradeoffs.
Stripe
Stripe retries failed webhooks up to 16 times over approximately 3 days. The spacing follows an exponential schedule: retries happen at roughly 1 minute, 5 minutes, 30 minutes, 2 hours, and so on, with the intervals widening until the final attempt at about 72 hours after the initial failure. Stripe considers any 2xx response a success. Non-2xx responses and timeouts (Stripe waits up to 20 seconds for a response) trigger retries. After all 16 attempts fail, the endpoint is disabled and Stripe sends an email notification. Stripe includes an event ID in every webhook, making idempotent processing straightforward.
GitHub
GitHub takes a more conservative approach. Failed webhook deliveries are retried up to 3 times within a few hours. GitHub waits 10 seconds for a response before timing out. After repeated failures across multiple events (not just a single event), GitHub marks the webhook as failing and stops attempting delivery. The webhook remains configured but inactive. Delivery logs are available in the repository's webhook settings for 8 days, effectively serving as a viewable (but not automatically replayable) DLQ. GitHub also provides a manual "Redeliver" button per event in the UI.
Shopify
Shopify retries webhook deliveries up to 19 times over a 48-hour period. If the endpoint fails to respond with a 2xx status for 19 consecutive attempts, Shopify deletes the webhook subscription entirely, a noticeably more aggressive stance than Stripe or GitHub, which merely disable the endpoint. Shopify waits 5 seconds for a response. This means if your server is consistently slow (but not failing), Shopify will remove your webhook. I have seen teams surprised by this behavior during high-traffic events like flash sales.
The patterns are consistent across all three: exponential backoff with widening intervals, a hard ceiling on total attempts, and an escalation action (disable or delete) after persistent failure. Where they differ is in tolerance. Stripe is the most patient. Shopify is the least forgiving. GitHub sits in between but provides the best visibility into failure history.
Putting It All Together
A production webhook delivery system is not just a retry loop. It is a layered set of protections: jittered exponential backoff to spread load, status-code-aware retry decisions to avoid wasting effort, idempotency keys to make retries safe, circuit breakers to protect individual endpoints, retry budgets to protect the system as a whole, and dead letter queues to catch everything that falls through.
Start simple. Exponential backoff with full jitter and a dead letter queue gets you 90% of the way there. Add circuit breakers once you have enough endpoints that a single bad one can affect delivery to others. Add retry budgets when you reach a scale where correlated failures become a realistic risk. But build with idempotency from day one, because retrofitting it later is painful and error-prone.