Back to Blog
Best practices

Rate Limiting Webhooks: Protecting Both Sides of the Connection

Rate limiting for webhooks works differently than for APIs. Token buckets, per-endpoint limits, backpressure, and handling 429 responses from the sender and receiver perspective.

WebhookVault Team··8 min read

When 10,000 Events Arrive in 30 Seconds

A few years ago, I watched a SaaS team import 50,000 customer records through a bulk CSV upload. Each record triggered a customer.created webhook. Their downstream accounting integration received all 50,000 POST requests within about 40 seconds, exhausted its database connection pool in under 10 seconds, and fell over. The integration was offline for two hours. Every webhook after the first 300 or so was lost because there was no retry logic on either side.

API rate limiting is a solved problem. You slap a limit on inbound requests, return 429 when someone exceeds it, and move on. Webhooks are different because traffic flows in the opposite direction: the sender pushes data to the receiver. Neither side has full control. The sender does not know how much capacity the receiver has. The receiver cannot predict when a burst will arrive. Both sides need independent strategies, and those strategies need to coordinate through HTTP semantics when things go wrong.

Sender-Side: The Token Bucket

The token bucket algorithm is the standard approach for smoothing outbound webhook delivery. The idea is simple: you have a bucket that holds tokens. Each webhook delivery costs one token. Tokens refill at a fixed rate. When the bucket is empty, deliveries are queued until tokens become available.

Two parameters control behavior. Capacity is the maximum number of tokens the bucket can hold, and this determines your burst allowance. Refill rate is how many tokens are added per second, which determines your sustained throughput. A bucket with capacity 50 and refill rate 10/second allows a burst of 50 immediate deliveries, then sustains 10 per second after that. In practice, I set capacity to roughly 5x the refill rate, which absorbs short spikes without letting truly massive bursts through.

class TokenBucket {
  private tokens: number
  private lastRefill: number
  private readonly capacity: number
  private readonly refillRate: number // tokens per second

  constructor(capacity: number, refillRate: number) {
    this.capacity = capacity
    this.refillRate = refillRate
    this.tokens = capacity
    this.lastRefill = Date.now()
  }

  tryConsume(): boolean {
    this.refill()

    if (this.tokens >= 1) {
      this.tokens -= 1
      return true
    }

    return false
  }

  getWaitTimeMs(): number {
    this.refill()

    if (this.tokens >= 1) return 0
    const tokensNeeded = 1 - this.tokens
    return Math.ceil((tokensNeeded / this.refillRate) * 1000)
  }

  private refill(): void {
    const now = Date.now()
    const elapsed = (now - this.lastRefill) / 1000
    const newTokens = elapsed * this.refillRate

    this.tokens = Math.min(this.capacity, this.tokens + newTokens)
    this.lastRefill = now
  }
}

The tryConsume method is non-blocking and returns false immediately if no tokens are available. The getWaitTimeMs method tells the caller how long to wait before a token will be available. In a webhook delivery worker, you would call tryConsume before each delivery attempt, and if it returns false, schedule the delivery for getWaitTimeMs() milliseconds later. Never drop webhooks because of rate limiting. Queue them.

Per-Endpoint Limits

A global token bucket protects your own infrastructure, but it treats all consumers equally. That is rarely correct. A small startup running a single-instance Express server cannot handle the same throughput as a Fortune 500 company with an auto-scaling Kubernetes cluster. I have seen webhook platforms burn their smaller customers by delivering at a rate tuned for the largest.

Per-endpoint rate limits let you assign different delivery speeds to different consumers. The implementation is a Map of token buckets, keyed by endpoint ID. Each endpoint gets its own capacity and refill rate, configurable either through an API or through defaults based on the consumer's plan tier.

interface EndpointLimitConfig {
  capacity: number
  refillRate: number
}

class PerEndpointLimiter {
  private buckets = new Map<string, TokenBucket>()
  private configs = new Map<string, EndpointLimitConfig>()

  private readonly defaultConfig: EndpointLimitConfig = {
    capacity: 50,
    refillRate: 10,
  }

  setEndpointConfig(endpointId: string, config: EndpointLimitConfig): void {
    this.configs.set(endpointId, config)
    // Reset the bucket with new config
    this.buckets.set(
      endpointId,
      new TokenBucket(config.capacity, config.refillRate)
    )
  }

  tryConsume(endpointId: string): boolean {
    const bucket = this.getOrCreateBucket(endpointId)
    return bucket.tryConsume()
  }

  getWaitTimeMs(endpointId: string): number {
    const bucket = this.getOrCreateBucket(endpointId)
    return bucket.getWaitTimeMs()
  }

  private getOrCreateBucket(endpointId: string): TokenBucket {
    let bucket = this.buckets.get(endpointId)
    if (!bucket) {
      const config = this.configs.get(endpointId) ?? this.defaultConfig
      bucket = new TokenBucket(config.capacity, config.refillRate)
      this.buckets.set(endpointId, bucket)
    }
    return bucket
  }
}

A reasonable default is 50 capacity / 10 per second for standard-tier consumers. Enterprise endpoints might get 500 capacity / 100 per second. Free-tier endpoints might be throttled to 10 capacity / 2 per second. The key point is that these limits should be visible and configurable by the consumer, because nobody likes being throttled silently.

Receiver-Side: Handling 429 Responses

When you are the receiver, consuming webhooks from a third-party service, you sometimes hit their rate limit. Stripe, GitHub, and Shopify all enforce delivery rate limits on their end. When this happens, they respond with HTTP 429 and usually include a Retry-After header telling you when to try again.

A common mistake is ignoring the Retry-After header and falling back to a fixed delay. The sender gave you specific information about when they will accept requests again. Use it. The header comes in two formats: a number of seconds (Retry-After: 30) or an HTTP-date (Retry-After: Thu, 01 Feb 2026 12:00:00 GMT). Your parsing code needs to handle both.

function parseRetryAfter(header: string | null): number | null {
  if (!header) return null

  // Try parsing as seconds first (most common)
  const seconds = parseInt(header, 10)
  if (!isNaN(seconds) && seconds > 0) {
    return seconds * 1000
  }

  // Try parsing as HTTP-date
  const date = new Date(header)
  if (!isNaN(date.getTime())) {
    const delayMs = date.getTime() - Date.now()
    return delayMs > 0 ? delayMs : 100 // minimum 100ms if date is in the past
  }

  return null
}

async function handleRateLimitedResponse(
  response: Response,
  retryFn: () => Promise
): Promise {
  if (response.status !== 429) return response

  const retryAfterMs = parseRetryAfter(
    response.headers.get('Retry-After')
  )
  const waitMs = retryAfterMs ?? 60_000 // default to 60s if no header

  await new Promise(resolve => setTimeout(resolve, waitMs))
  return retryFn()
}

One thing I always add: a maximum wait cap. If a Retry-After header says 3600 seconds (one hour), I will not block a worker thread for an hour. Instead, I push that event into a delayed queue and free the worker to process other deliveries. The cap depends on your architecture, but 5 minutes is a good starting point for synchronous processing.

Backpressure: When the Receiver Pushes Back

Backpressure is the receiver telling the sender "slow down" in real time. The HTTP protocol gives us a clean mechanism for this: return 429 with a Retry-After header. Any well-behaved webhook sender will respect this signal and reduce its delivery rate.

On the receiver side, implementing backpressure means tracking your own processing capacity and rejecting webhooks when you are approaching your limit. The simplest approach is a bounded queue. Incoming webhooks go into the queue. A pool of workers drains the queue at a controlled pace. When the queue reaches a high-water mark, new incoming webhooks get a 429 response. When it drains below a low-water mark, you start accepting again.

I have found that setting the high-water mark at 80% of queue capacity and the low-water mark at 50% works well. This gives you a buffer zone that prevents rapid oscillation between accepting and rejecting. Without the gap, you get a pathological pattern where you reject one request, the queue drops to 79%, you accept the next request, it goes to 80%, you reject again, back and forth on every single request.

The Retry-After value you return should reflect your actual recovery time. If your queue is 80% full and your workers process 100 events per second, and the queue holds 1000 events, you have 800 events ahead. That is 8 seconds of drain time to get back to a comfortable level. Return Retry-After: 10 to give yourself a margin.

Batching as an Alternative to Rate Limiting

Sometimes the better answer is not to limit individual event delivery but to combine multiple events into a single request. Instead of sending 500 order.updated webhooks during a bulk price change, send one webhook with 500 events in the payload. This reduces connection overhead, TCP handshakes, and the total number of requests the receiver needs to handle.

A batch collector accumulates events and flushes them based on two triggers: a size threshold (send when you have N events) and a time threshold (send after N seconds even if you have not hit the size limit). The time threshold prevents events from sitting in the buffer indefinitely during low-traffic periods.

class WebhookBatchCollector {
  private buffer: Array<{ type: string; data: unknown; timestamp: number }> = []
  private timer: ReturnType<typeof setTimeout> | null = null
  private readonly maxBatchSize: number
  private readonly flushIntervalMs: number
  private readonly onFlush: (events: Array<{ type: string; data: unknown; timestamp: number }>) => Promise<void>

  constructor(
    maxBatchSize: number,
    flushIntervalMs: number,
    onFlush: (events: Array<{ type: string; data: unknown; timestamp: number }>) => Promise<void>
  ) {
    this.maxBatchSize = maxBatchSize
    this.flushIntervalMs = flushIntervalMs
    this.onFlush = onFlush
  }

  add(eventType: string, data: unknown): void {
    this.buffer.push({
      type: eventType,
      data,
      timestamp: Date.now(),
    })

    if (this.buffer.length >= this.maxBatchSize) {
      this.flush()
      return
    }

    if (!this.timer) {
      this.timer = setTimeout(() => this.flush(), this.flushIntervalMs)
    }
  }

  private async flush(): Promise<void> {
    if (this.timer) {
      clearTimeout(this.timer)
      this.timer = null
    }

    if (this.buffer.length === 0) return

    const batch = this.buffer.splice(0)
    await this.onFlush(batch)
  }
}

// Usage: flush every 100 events or every 5 seconds
const collector = new WebhookBatchCollector(100, 5000, async (events) => {
  await fetch(endpointUrl, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ events }),
  })
})

Batching does require receiver-side changes. The consumer needs to iterate over the array of events instead of processing a single event. In my experience, this is a worthwhile trade. A receiver that processes 100 events in a single request is dramatically more efficient than one that handles 100 separate HTTP connections, even if the total data volume is identical. The overhead is in the connections, not the payloads.

A word of caution on batch sizes: do not go above 500 events or 1MB per batch. Large payloads hit request size limits, increase timeout risk, and make partial failure handling messy. If one event in a batch of 1000 fails validation, does the whole batch fail? Do you need per-event error reporting? Keep batches small enough that retrying the entire batch on failure is acceptable.

Monitoring Rate Limit Usage

Rate limiting without monitoring is flying blind. You need to know four things at all times: how often limits are being hit, how deep your delivery queues are, what your end-to-end delivery latency looks like, and which endpoints are consistently operating near their limits.

Track the ratio of rate-limited deliveries to total deliveries over 5-minute windows. If more than 5% of your deliveries are being delayed by rate limits, either your limits are too aggressive or your consumers need more capacity. Track p50 and p99 delivery latency. The gap between them tells you how much variance your rate limiting is introducing. A p50 of 200ms with a p99 of 30 seconds means some webhooks are sitting in a queue for a long time.

Queue depth is your early warning system. A steadily growing queue means events are arriving faster than you can deliver them. This is normal during bursts: the queue should grow and then drain. If the queue grows monotonically for more than 10 minutes, something is wrong. Either the consumer is down (and you should be circuit-breaking), or your rate limit is set too low for the sustained event volume.

Per-endpoint monitoring matters more than aggregate monitoring. A global average can look healthy while a single endpoint is drowning. I recommend alerting when any individual endpoint has more than 100 queued deliveries or when its average delivery latency exceeds 60 seconds. These thresholds will vary by system, but the principle holds: watch individual endpoints, not just totals.

Choosing the Right Strategy

These approaches are not mutually exclusive. In practice, a production system uses several of them together. Token buckets smooth sender-side delivery. Per-endpoint limits respect individual consumer capacity. Receivers return 429 with Retry-After when they are overloaded. Batching reduces total request volume for high-throughput event types.

Start with the sender-side token bucket and per-endpoint limits. These handle 80% of rate limiting needs with minimal complexity. Add receiver-side backpressure if you are consuming webhooks from third-party services that burst unpredictably. Introduce batching when your event volume is high enough that the per-request HTTP overhead becomes a bottleneck, typically above 1000 events per minute to a single endpoint.

The mistake I see most often is treating rate limiting as a set-and-forget configuration. Event volumes change. Consumer capacity changes. Seasonal traffic patterns shift. Review your limits quarterly, tie them to monitoring data, and give consumers the ability to request higher limits when they need them. Rate limiting should be a conversation between sender and receiver, not a wall.