When 10,000 Events Arrive in 30 Seconds
A few years ago, I watched a SaaS team import 50,000 customer records through a bulk CSV upload. Each record
triggered a customer.created webhook. Their downstream accounting integration received all 50,000
POST requests within about 40 seconds, exhausted its database connection pool in under 10 seconds, and fell over.
The integration was offline for two hours. Every webhook after the first 300 or so was lost because there was
no retry logic on either side.
API rate limiting is a solved problem. You slap a limit on inbound requests, return 429 when someone exceeds it, and move on. Webhooks are different because traffic flows in the opposite direction: the sender pushes data to the receiver. Neither side has full control. The sender does not know how much capacity the receiver has. The receiver cannot predict when a burst will arrive. Both sides need independent strategies, and those strategies need to coordinate through HTTP semantics when things go wrong.
Sender-Side: The Token Bucket
The token bucket algorithm is the standard approach for smoothing outbound webhook delivery. The idea is simple: you have a bucket that holds tokens. Each webhook delivery costs one token. Tokens refill at a fixed rate. When the bucket is empty, deliveries are queued until tokens become available.
Two parameters control behavior. Capacity is the maximum number of tokens the bucket can hold, and this determines your burst allowance. Refill rate is how many tokens are added per second, which determines your sustained throughput. A bucket with capacity 50 and refill rate 10/second allows a burst of 50 immediate deliveries, then sustains 10 per second after that. In practice, I set capacity to roughly 5x the refill rate, which absorbs short spikes without letting truly massive bursts through.
class TokenBucket {
private tokens: number
private lastRefill: number
private readonly capacity: number
private readonly refillRate: number // tokens per second
constructor(capacity: number, refillRate: number) {
this.capacity = capacity
this.refillRate = refillRate
this.tokens = capacity
this.lastRefill = Date.now()
}
tryConsume(): boolean {
this.refill()
if (this.tokens >= 1) {
this.tokens -= 1
return true
}
return false
}
getWaitTimeMs(): number {
this.refill()
if (this.tokens >= 1) return 0
const tokensNeeded = 1 - this.tokens
return Math.ceil((tokensNeeded / this.refillRate) * 1000)
}
private refill(): void {
const now = Date.now()
const elapsed = (now - this.lastRefill) / 1000
const newTokens = elapsed * this.refillRate
this.tokens = Math.min(this.capacity, this.tokens + newTokens)
this.lastRefill = now
}
}
The tryConsume method is non-blocking and returns false immediately if no tokens are available.
The getWaitTimeMs method tells the caller how long to wait before a token will be available. In a
webhook delivery worker, you would call tryConsume before each delivery attempt, and if it returns
false, schedule the delivery for getWaitTimeMs() milliseconds later. Never drop webhooks because
of rate limiting. Queue them.
Per-Endpoint Limits
A global token bucket protects your own infrastructure, but it treats all consumers equally. That is rarely correct. A small startup running a single-instance Express server cannot handle the same throughput as a Fortune 500 company with an auto-scaling Kubernetes cluster. I have seen webhook platforms burn their smaller customers by delivering at a rate tuned for the largest.
Per-endpoint rate limits let you assign different delivery speeds to different consumers. The implementation is a Map of token buckets, keyed by endpoint ID. Each endpoint gets its own capacity and refill rate, configurable either through an API or through defaults based on the consumer's plan tier.
interface EndpointLimitConfig {
capacity: number
refillRate: number
}
class PerEndpointLimiter {
private buckets = new Map<string, TokenBucket>()
private configs = new Map<string, EndpointLimitConfig>()
private readonly defaultConfig: EndpointLimitConfig = {
capacity: 50,
refillRate: 10,
}
setEndpointConfig(endpointId: string, config: EndpointLimitConfig): void {
this.configs.set(endpointId, config)
// Reset the bucket with new config
this.buckets.set(
endpointId,
new TokenBucket(config.capacity, config.refillRate)
)
}
tryConsume(endpointId: string): boolean {
const bucket = this.getOrCreateBucket(endpointId)
return bucket.tryConsume()
}
getWaitTimeMs(endpointId: string): number {
const bucket = this.getOrCreateBucket(endpointId)
return bucket.getWaitTimeMs()
}
private getOrCreateBucket(endpointId: string): TokenBucket {
let bucket = this.buckets.get(endpointId)
if (!bucket) {
const config = this.configs.get(endpointId) ?? this.defaultConfig
bucket = new TokenBucket(config.capacity, config.refillRate)
this.buckets.set(endpointId, bucket)
}
return bucket
}
}
A reasonable default is 50 capacity / 10 per second for standard-tier consumers. Enterprise endpoints might get 500 capacity / 100 per second. Free-tier endpoints might be throttled to 10 capacity / 2 per second. The key point is that these limits should be visible and configurable by the consumer, because nobody likes being throttled silently.
Receiver-Side: Handling 429 Responses
When you are the receiver, consuming webhooks from a third-party service, you sometimes hit their rate limit.
Stripe, GitHub, and Shopify all enforce delivery rate limits on their end. When this happens, they respond
with HTTP 429 and usually include a Retry-After header telling you when to try again.
A common mistake is ignoring the Retry-After header and falling back to a fixed delay. The sender
gave you specific information about when they will accept requests again. Use it. The header comes in two formats:
a number of seconds (Retry-After: 30) or an HTTP-date (Retry-After: Thu, 01 Feb 2026 12:00:00 GMT).
Your parsing code needs to handle both.
function parseRetryAfter(header: string | null): number | null {
if (!header) return null
// Try parsing as seconds first (most common)
const seconds = parseInt(header, 10)
if (!isNaN(seconds) && seconds > 0) {
return seconds * 1000
}
// Try parsing as HTTP-date
const date = new Date(header)
if (!isNaN(date.getTime())) {
const delayMs = date.getTime() - Date.now()
return delayMs > 0 ? delayMs : 100 // minimum 100ms if date is in the past
}
return null
}
async function handleRateLimitedResponse(
response: Response,
retryFn: () => Promise
): Promise {
if (response.status !== 429) return response
const retryAfterMs = parseRetryAfter(
response.headers.get('Retry-After')
)
const waitMs = retryAfterMs ?? 60_000 // default to 60s if no header
await new Promise(resolve => setTimeout(resolve, waitMs))
return retryFn()
}
One thing I always add: a maximum wait cap. If a Retry-After header says 3600 seconds (one hour),
I will not block a worker thread for an hour. Instead, I push that event into a delayed queue and free the
worker to process other deliveries. The cap depends on your architecture, but 5 minutes is a good starting
point for synchronous processing.
Backpressure: When the Receiver Pushes Back
Backpressure is the receiver telling the sender "slow down" in real time. The HTTP protocol gives us a clean
mechanism for this: return 429 with a Retry-After header. Any well-behaved webhook sender will
respect this signal and reduce its delivery rate.
On the receiver side, implementing backpressure means tracking your own processing capacity and rejecting webhooks when you are approaching your limit. The simplest approach is a bounded queue. Incoming webhooks go into the queue. A pool of workers drains the queue at a controlled pace. When the queue reaches a high-water mark, new incoming webhooks get a 429 response. When it drains below a low-water mark, you start accepting again.
I have found that setting the high-water mark at 80% of queue capacity and the low-water mark at 50% works well. This gives you a buffer zone that prevents rapid oscillation between accepting and rejecting. Without the gap, you get a pathological pattern where you reject one request, the queue drops to 79%, you accept the next request, it goes to 80%, you reject again, back and forth on every single request.
The Retry-After value you return should reflect your actual recovery time. If your queue is
80% full and your workers process 100 events per second, and the queue holds 1000 events, you have 800 events
ahead. That is 8 seconds of drain time to get back to a comfortable level. Return Retry-After: 10
to give yourself a margin.
Batching as an Alternative to Rate Limiting
Sometimes the better answer is not to limit individual event delivery but to combine multiple events into
a single request. Instead of sending 500 order.updated webhooks during a bulk price change,
send one webhook with 500 events in the payload. This reduces connection overhead, TCP handshakes, and
the total number of requests the receiver needs to handle.
A batch collector accumulates events and flushes them based on two triggers: a size threshold (send when you have N events) and a time threshold (send after N seconds even if you have not hit the size limit). The time threshold prevents events from sitting in the buffer indefinitely during low-traffic periods.
class WebhookBatchCollector {
private buffer: Array<{ type: string; data: unknown; timestamp: number }> = []
private timer: ReturnType<typeof setTimeout> | null = null
private readonly maxBatchSize: number
private readonly flushIntervalMs: number
private readonly onFlush: (events: Array<{ type: string; data: unknown; timestamp: number }>) => Promise<void>
constructor(
maxBatchSize: number,
flushIntervalMs: number,
onFlush: (events: Array<{ type: string; data: unknown; timestamp: number }>) => Promise<void>
) {
this.maxBatchSize = maxBatchSize
this.flushIntervalMs = flushIntervalMs
this.onFlush = onFlush
}
add(eventType: string, data: unknown): void {
this.buffer.push({
type: eventType,
data,
timestamp: Date.now(),
})
if (this.buffer.length >= this.maxBatchSize) {
this.flush()
return
}
if (!this.timer) {
this.timer = setTimeout(() => this.flush(), this.flushIntervalMs)
}
}
private async flush(): Promise<void> {
if (this.timer) {
clearTimeout(this.timer)
this.timer = null
}
if (this.buffer.length === 0) return
const batch = this.buffer.splice(0)
await this.onFlush(batch)
}
}
// Usage: flush every 100 events or every 5 seconds
const collector = new WebhookBatchCollector(100, 5000, async (events) => {
await fetch(endpointUrl, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ events }),
})
})
Batching does require receiver-side changes. The consumer needs to iterate over the array of events instead of processing a single event. In my experience, this is a worthwhile trade. A receiver that processes 100 events in a single request is dramatically more efficient than one that handles 100 separate HTTP connections, even if the total data volume is identical. The overhead is in the connections, not the payloads.
A word of caution on batch sizes: do not go above 500 events or 1MB per batch. Large payloads hit request size limits, increase timeout risk, and make partial failure handling messy. If one event in a batch of 1000 fails validation, does the whole batch fail? Do you need per-event error reporting? Keep batches small enough that retrying the entire batch on failure is acceptable.
Monitoring Rate Limit Usage
Rate limiting without monitoring is flying blind. You need to know four things at all times: how often limits are being hit, how deep your delivery queues are, what your end-to-end delivery latency looks like, and which endpoints are consistently operating near their limits.
Track the ratio of rate-limited deliveries to total deliveries over 5-minute windows. If more than 5% of your deliveries are being delayed by rate limits, either your limits are too aggressive or your consumers need more capacity. Track p50 and p99 delivery latency. The gap between them tells you how much variance your rate limiting is introducing. A p50 of 200ms with a p99 of 30 seconds means some webhooks are sitting in a queue for a long time.
Queue depth is your early warning system. A steadily growing queue means events are arriving faster than you can deliver them. This is normal during bursts: the queue should grow and then drain. If the queue grows monotonically for more than 10 minutes, something is wrong. Either the consumer is down (and you should be circuit-breaking), or your rate limit is set too low for the sustained event volume.
Per-endpoint monitoring matters more than aggregate monitoring. A global average can look healthy while a single endpoint is drowning. I recommend alerting when any individual endpoint has more than 100 queued deliveries or when its average delivery latency exceeds 60 seconds. These thresholds will vary by system, but the principle holds: watch individual endpoints, not just totals.
Choosing the Right Strategy
These approaches are not mutually exclusive. In practice, a production system uses several of them together.
Token buckets smooth sender-side delivery. Per-endpoint limits respect individual consumer capacity. Receivers
return 429 with Retry-After when they are overloaded. Batching reduces total request volume for
high-throughput event types.
Start with the sender-side token bucket and per-endpoint limits. These handle 80% of rate limiting needs with minimal complexity. Add receiver-side backpressure if you are consuming webhooks from third-party services that burst unpredictably. Introduce batching when your event volume is high enough that the per-request HTTP overhead becomes a bottleneck, typically above 1000 events per minute to a single endpoint.
The mistake I see most often is treating rate limiting as a set-and-forget configuration. Event volumes change. Consumer capacity changes. Seasonal traffic patterns shift. Review your limits quarterly, tie them to monitoring data, and give consumers the ability to request higher limits when they need them. Rate limiting should be a conversation between sender and receiver, not a wall.