Your webhook endpoint returns 200 OK in 47ms. Nice. Clean. Fast.
Then Stripe sends you 50,000 subscription renewal webhooks at midnight on the first of the month. Your database connection pool maxes out, your API starts throwing 503s, and you're getting paged at 2 AM because somehow a payment webhook from 3 hours ago still hasn't processed.
Synchronous webhook processing works great until it doesn't.
The Synchronous Trap
Most teams start here because it's simple. Webhook comes in, you validate it, you do the thing, you return 200. Maybe 50 lines of code total. Works perfectly fine when you're processing a few hundred webhooks a day.
But webhooks have this nasty habit of arriving in bursts. Black Friday sales. End of month billing cycles. A partner API having a bad day and retrying everything at once. A single marketing campaign that goes viral and suddenly every third-party integration you have is screaming at your servers simultaneously.
The problem isn't handling one webhook. The problem is handling one webhook while 500 others are waiting in line, and your webhook provider has a 30-second timeout before they mark you as failed and start exponential backoff retries.
Queue-First Architecture
The fix is embarrassingly simple in concept. Accept the webhook, throw it on a queue, return 200. Process it later.
// the endpoint does almost nothing
app.post('/webhooks/stripe', async (req, res) => {
const signature = req.headers['stripe-signature']
// quick signature check, bail if invalid
if (!verifyStripeSignature(req.body, signature)) {
return res.status(401).send('Invalid signature')
}
// don't process. just queue.
await queue.publish('stripe-webhooks', {
payload: req.body,
receivedAt: Date.now(),
signature: signature
})
// respond immediately
res.status(200).send('Accepted')
})
Your endpoint now responds in single-digit milliseconds. You could handle 10,000 webhooks per second and your response time wouldn't budge. The actual processing happens somewhere else entirely.
Why This Works Better Than You'd Think
Queues give you superpowers that synchronous processing can't match.
Backpressure. When your workers can't keep up, the queue just grows. No dropped requests, no timeouts, no angry retry storms. The work waits patiently until you can handle it.
Horizontal scaling. Need more processing capacity? Spin up more workers. They all pull from the same queue. No load balancer configuration, no sticky sessions, no coordination required.
Retry without re-delivery. If processing fails, the message goes back on the queue. You don't need to ask the webhook provider to send it again. You already have it.
Rate limiting yourself. Yeah, sounds weird. But sometimes you WANT to limit how fast you hit a downstream API. Processing from a queue lets you control the pace without rejecting incoming webhooks.
Picking the Right Queue
Redis with Bull or BullMQ handles most use cases. Fast, simple, good visibility into what's happening. I've seen teams run this in production for years handling millions of jobs daily.
import { Queue, Worker } from 'bullmq'
const webhookQueue = new Queue('webhooks', {
connection: { host: 'localhost', port: 6379 }
})
// worker can be in a completely different process, different server
const worker = new Worker('webhooks', async (job) => {
const { payload, receivedAt } = job.data
// actual processing happens here
// take as long as you need
await processWebhook(payload)
// track how long the job waited
const waitTime = Date.now() - receivedAt
metrics.histogram('webhook.queue.wait_time', waitTime)
}, {
connection: { host: 'localhost', port: 6379 },
concurrency: 10 // 10 parallel jobs per worker
})
For serious scale or when you need durability guarantees, look at SQS, RabbitMQ, or Kafka. SQS is braindead simple if you're already on AWS. Kafka is overkill for most webhook use cases unless you're doing event sourcing or need to replay history.
The Worker Pattern That Actually Works
Don't just dump webhooks into a single queue. Different webhook types have different processing characteristics. Stripe payment webhooks are critical and need fast processing. A GitHub star notification can wait.
// route to different queues by priority
const QUEUE_ROUTING = {
'payment_intent.succeeded': 'critical',
'customer.subscription.deleted': 'critical',
'invoice.payment_failed': 'critical',
'customer.updated': 'standard',
'product.created': 'bulk',
// everything else
'default': 'standard'
}
function routeWebhook(type: string): string {
return QUEUE_ROUTING[type] || QUEUE_ROUTING['default']
}
// different queues, different workers, different scaling
const criticalWorkers = 20 // fast, always available
const standardWorkers = 5
const bulkWorkers = 2 // can be slow, nobody cares
Run more workers on critical queues. Give them more resources. Monitor their latency separately. You don't want a backlog of low-priority webhooks affecting your payment processing.
Handling Failures Without Losing Data
Jobs fail. Networks flake out, databases restart, downstream APIs go down for maintenance at the worst possible time. Your queue needs to handle this gracefully.
const worker = new Worker('webhooks', processWebhook, {
connection: redisConnection,
concurrency: 10,
// failed jobs go back on the queue
settings: {
backoffStrategies: {
exponential: (attemptsMade) => {
// 1s, 2s, 4s, 8s, 16s...
return Math.min(1000 * Math.pow(2, attemptsMade), 300000)
}
}
}
})
worker.on('failed', (job, err) => {
if (job.attemptsMade >= 5) {
// after 5 failures, move to dead letter queue
deadLetterQueue.add('failed-webhook', {
originalJob: job.data,
failedAt: Date.now(),
error: err.message,
attempts: job.attemptsMade
})
alert(`Webhook permanently failed: ${job.data.payload.type}`)
}
})
Dead letter queues are non-negotiable. When a webhook fails after multiple retries, you need it stored somewhere you can investigate later. I've debugged production issues by replaying webhooks from a DLQ weeks after the original failure.
Idempotency Becomes Critical
Queues retry. Webhook providers retry. Sometimes your own monitoring replays events for debugging. The same webhook might get processed three, four, five times.
Every processor needs to be idempotent. You've probably heard this before. But with queues it's not optional, it's mandatory.
async function processPaymentWebhook(payload: StripePayment) {
const eventId = payload.id
// check if we've already processed this
const processed = await redis.get(`processed:${eventId}`)
if (processed) {
console.log(`Skipping duplicate: ${eventId}`)
return
}
// process the payment
await updateUserSubscription(payload)
await sendConfirmationEmail(payload)
// mark as processed with TTL
// keep for 7 days to handle late duplicates
await redis.set(`processed:${eventId}`, '1', 'EX', 604800)
}
Store the event ID before AND after processing if you want to be paranoid. Check at the start, write at the end, handle the edge case where processing succeeded but marking failed.
Monitoring the Queue Itself
A queue you can't observe is a bomb waiting to go off. You need visibility into queue depth, processing rates, failure rates, and latency.
// emit metrics from your worker
setInterval(async () => {
const waiting = await webhookQueue.getWaitingCount()
const active = await webhookQueue.getActiveCount()
const failed = await webhookQueue.getFailedCount()
metrics.gauge('webhook.queue.waiting', waiting)
metrics.gauge('webhook.queue.active', active)
metrics.gauge('webhook.queue.failed', failed)
// alert if queue is backing up
if (waiting > 1000) {
alert(`Webhook queue backing up: ${waiting} waiting`)
}
}, 10000)
Set alerts on queue growth. If waiting jobs keep climbing while active stays constant, you need more workers. If failed keeps growing, something's broken downstream.
When NOT to Use Queues
Sometimes synchronous is actually fine. If you're processing fewer than a hundred webhooks per hour and they're all simple DB writes, the queue overhead isn't worth it. Just make sure your endpoint is fast and your database can handle bursts.
Also avoid queues when you need immediate confirmation to the webhook sender. Some payment gateways want you to return specific data in the response. You can't queue those. But you CAN queue any follow-up work that doesn't need to be in the response.
// partial queuing: validate and respond, queue the heavy stuff
app.post('/webhooks/payment', async (req, res) => {
const result = await validateAndRecordPayment(req.body) // fast
// queue the slow stuff
await queue.publish('post-payment-tasks', {
paymentId: result.id,
tasks: ['send-email', 'update-analytics', 'notify-slack']
})
res.json({ received: true, paymentId: result.id })
})
The Architecture That Scales
After running webhook infrastructure for multiple high-volume systems, the pattern that works looks something like this:
Stateless receiver pods that do nothing but validate and queue. Scale these based on incoming request rate. They should be cheap and fast.
Multiple worker pools organized by priority. Critical webhooks get dedicated workers that are always available. Standard webhooks share a pool. Bulk operations run during off-peak hours.
A dead letter queue with alerting. When things end up here, someone should know.
Metrics on everything. Queue depth, processing time, failure rates, retry counts. You can't fix what you can't see.
The initial setup takes maybe a day longer than synchronous processing. But when that burst of 50,000 webhooks hits, you'll be glad you spent the time. Your endpoint returns 200 in 5ms, your workers chew through the backlog over the next hour, and you sleep through the night.
Queue-based webhook processing isn't fancy. It's just engineering basics applied correctly. Accept fast, process slow, never lose data. That's it.