Back to Blog
Integration

Webhook Integration Patterns: From Direct Processing to Event-Driven Architecture

Most webhook handlers start as a single endpoint with inline processing. When that breaks down, you need queues, fan-out, gateways, and sagas. Patterns for scaling webhook consumers.

WebhookVault Team··8 min read

Direct Processing: The Starting Point

Every webhook integration starts the same way. You stand up an endpoint, parse the payload, do some work, and return a 200. It looks something like this:

app.post('/webhooks/payments', async (req, res) => {
  const event = req.body

  // Verify signature
  verifySignature(req.headers['x-signature'], req.body)

  // Process inline
  const order = await db.orders.findById(event.data.orderId)
  order.status = 'paid'
  await db.orders.save(order)

  // Send confirmation email
  await emailService.send(order.customerEmail, 'order-confirmed', order)

  // Update analytics
  await analytics.track('payment.completed', { amount: event.data.amount })

  res.status(200).json({ received: true })
})

This works beautifully for the first few months. Then you add a third-party inventory sync that takes 3 seconds. Then the email service starts timing out under load. Then someone adds a Slack notification that occasionally hangs. Suddenly your 200ms handler takes 12 seconds, and the webhook provider starts retrying because you blew past their 5-second timeout window.

I've seen teams hit this wall around 500-1000 webhooks per hour. The symptoms are predictable: duplicate processing from retries, cascading failures when one downstream service goes down, and a growing backlog of unprocessed events. The fix isn't to optimize the inline code. It's to change the architecture.

Queue-Based Architecture

The single most effective pattern change is separating receipt from processing. Your webhook endpoint does exactly two things: validate the request and push it onto a queue. Everything else happens asynchronously.

import { Queue, Worker } from 'bullmq'

const webhookQueue = new Queue('webhooks', {
  connection: { host: 'redis', port: 6379 },
  defaultJobOptions: {
    attempts: 5,
    backoff: { type: 'exponential', delay: 2000 },
    removeOnComplete: 1000,
    removeOnFail: 5000,
  },
})

// Endpoint: acknowledge fast, enqueue
app.post('/webhooks/payments', async (req, res) => {
  verifySignature(req.headers['x-signature'], req.body)

  await webhookQueue.add('payment.received', {
    payload: req.body,
    headers: req.headers,
    receivedAt: Date.now(),
  })

  res.status(200).json({ received: true })
})

// Worker: process at your own pace
const worker = new Worker('webhooks', async (job) => {
  const { payload } = job.data

  switch (payload.type) {
    case 'payment.completed':
      await handlePaymentCompleted(payload)
      break
    case 'payment.failed':
      await handlePaymentFailed(payload)
      break
    default:
      console.log('Unhandled event type:', payload.type)
  }
}, { connection: { host: 'redis', port: 6379 }, concurrency: 10 })

The tradeoff is real: you've added Redis as a dependency, introduced a worker process that needs monitoring, and made debugging harder because errors happen away from the HTTP request. But you've gained something valuable. Your webhook endpoint now responds in under 50ms regardless of what happens downstream. Failed jobs retry independently. You can scale workers horizontally without touching the endpoint. And if your entire processing pipeline goes down, the queue holds your events until it comes back.

A common mistake is treating the queue as a simple buffer. It's not. You need dead letter queues for permanently failed jobs, monitoring on queue depth, and alerting when processing latency exceeds your SLA. In practice, I recommend setting up a dashboard that shows jobs processed per minute, average processing time, and failure rate. If any of those metrics drift, you want to know before your users do.

The Fan-Out Pattern

Some webhooks need to trigger multiple independent actions. An e-commerce order webhook might need to update inventory, send a confirmation email, notify the warehouse, update analytics, and sync to the CRM. These actions don't depend on each other, and a failure in one shouldn't block the others.

type WebhookHandler = (payload: Record<string, unknown>) => Promise<void>

const handlers: Record<string, WebhookHandler[]> = {
  'order.created': [
    updateInventory,
    sendConfirmationEmail,
    notifyWarehouse,
    trackAnalytics,
    syncToCRM,
  ],
  'order.cancelled': [
    restoreInventory,
    sendCancellationEmail,
    trackAnalytics,
  ],
}

async function dispatch(eventType: string, payload: Record<string, unknown>) {
  const eventHandlers = handlers[eventType] || []

  const results = await Promise.allSettled(
    eventHandlers.map(handler => handler(payload))
  )

  const failures = results.filter(r => r.status === 'rejected')
  if (failures.length > 0) {
    console.error(`${failures.length}/${results.length} handlers failed for ${eventType}`)
    // Enqueue failed handlers for retry, not the whole batch
    for (let i = 0; i < results.length; i++) {
      if (results[i].status === 'rejected') {
        await retryQueue.add('retry-handler', {
          handlerName: eventHandlers[i].name,
          payload,
          eventType,
        })
      }
    }
  }
}

The key detail here is Promise.allSettled instead of Promise.all. If the email service is down, you still want inventory updated and analytics tracked. Each handler runs in isolation, and only the failures get retried. This sounds obvious, but I've reviewed codebases where a single failing handler caused every downstream action to be skipped and the webhook to return a 500, triggering a full retry of everything.

The Webhook Gateway

Once you're consuming webhooks from 3 or more providers, the differences in payload format, authentication, and event naming become a maintenance problem. Stripe sends payment_intent.succeeded. PayPal sends PAYMENT.CAPTURE.COMPLETED. Shopify sends orders/paid. Your business logic shouldn't care about any of that.

A webhook gateway sits at the edge and normalizes everything into a unified internal event format.

interface InternalEvent {
  id: string
  source: 'stripe' | 'paypal' | 'shopify'
  type: string                    // your internal event name
  occurredAt: string
  data: Record<string, unknown>
  raw: Record<string, unknown>    // original payload for debugging
}

const normalizers: Record<string, (payload: any) => InternalEvent | null> = {
  stripe: (payload) => {
    const typeMap: Record<string, string> = {
      'payment_intent.succeeded': 'payment.completed',
      'payment_intent.payment_failed': 'payment.failed',
      'customer.subscription.deleted': 'subscription.cancelled',
    }
    const internalType = typeMap[payload.type]
    if (!internalType) return null

    return {
      id: payload.id,
      source: 'stripe',
      type: internalType,
      occurredAt: new Date(payload.created * 1000).toISOString(),
      data: {
        amount: payload.data.object.amount / 100,
        currency: payload.data.object.currency,
        customerId: payload.data.object.customer,
      },
      raw: payload,
    }
  },

  paypal: (payload) => {
    const typeMap: Record<string, string> = {
      'PAYMENT.CAPTURE.COMPLETED': 'payment.completed',
      'PAYMENT.CAPTURE.DENIED': 'payment.failed',
    }
    const internalType = typeMap[payload.event_type]
    if (!internalType) return null

    return {
      id: payload.id,
      source: 'paypal',
      type: internalType,
      occurredAt: payload.create_time,
      data: {
        amount: parseFloat(payload.resource.amount.value),
        currency: payload.resource.amount.currency_code,
        customerId: payload.resource.payer?.payer_id,
      },
      raw: payload,
    }
  },
}

// Gateway endpoint
app.post('/webhooks/:provider', async (req, res) => {
  const normalizer = normalizers[req.params.provider]
  if (!normalizer) return res.status(404).end()

  const event = normalizer(req.body)
  if (!event) return res.status(200).json({ skipped: true })

  await eventBus.publish(event)
  res.status(200).json({ received: true })
})

After this transformation, every downstream consumer works with the same InternalEvent shape. Adding a new payment provider means writing one normalizer function, not touching every handler in your system. The raw field preserves the original payload for debugging, which you'll be grateful for when a normalizer has a bug and you need to reprocess events.

The Saga Pattern

Some workflows span multiple services and take minutes or hours to complete. A subscription signup might involve: create account, provision resources, charge payment, send welcome email. Each step depends on the previous one succeeding, and if step 3 fails, you need to undo steps 1 and 2.

This is where the saga pattern fits. A saga orchestrator tracks the state of a multi-step workflow and knows how to compensate (roll back) when a step fails.

interface SagaStep {
  name: string
  execute: (context: Record<string, unknown>) => Promise<Record<string, unknown>>
  compensate: (context: Record<string, unknown>) => Promise<void>
}

async function executeSaga(
  sagaId: string,
  steps: SagaStep[],
  initialContext: Record<string, unknown>
) {
  let context = { ...initialContext }
  const completed: SagaStep[] = []

  for (const step of steps) {
    try {
      const result = await step.execute(context)
      context = { ...context, ...result }
      completed.push(step)
      await saveSagaState(sagaId, step.name, 'completed', context)
    } catch (error) {
      await saveSagaState(sagaId, step.name, 'failed', context)

      // Compensate in reverse order
      for (const completedStep of completed.reverse()) {
        try {
          await completedStep.compensate(context)
          await saveSagaState(sagaId, completedStep.name, 'compensated', context)
        } catch (compensateError) {
          // Compensation failed, needs manual intervention
          await saveSagaState(sagaId, completedStep.name, 'compensation-failed', context)
          await alertOps(sagaId, completedStep.name, compensateError)
        }
      }
      throw error
    }
  }
  return context
}

// Usage: triggered by a webhook
const subscriptionSignup: SagaStep[] = [
  {
    name: 'create-account',
    execute: async (ctx) => {
      const account = await accountService.create(ctx.email as string)
      return { accountId: account.id }
    },
    compensate: async (ctx) => { await accountService.delete(ctx.accountId as string) },
  },
  {
    name: 'provision-resources',
    execute: async (ctx) => {
      const resources = await provisionService.create(ctx.accountId as string)
      return { resourceIds: resources.map(r => r.id) }
    },
    compensate: async (ctx) => {
      for (const id of ctx.resourceIds as string[]) {
        await provisionService.delete(id)
      }
    },
  },
  {
    name: 'charge-payment',
    execute: async (ctx) => {
      const charge = await billingService.charge(ctx.accountId as string, ctx.plan as string)
      return { chargeId: charge.id }
    },
    compensate: async (ctx) => { await billingService.refund(ctx.chargeId as string) },
  },
]

The hardest part of sagas is compensation failure. If creating the account succeeds, provisioning succeeds, but charging fails, you need to deprovision and delete the account. But what if deprovisioning also fails? You're now in an inconsistent state that requires human intervention. Always persist saga state to a database and build an admin view that shows stuck sagas. In my experience, about 0.1-0.5% of sagas hit compensation failures, and having visibility into those cases saves hours of debugging.

Event Sourcing Intersection

There's an interesting overlap between webhooks and event sourcing. If you store every raw webhook payload as an immutable event, you can reconstruct the current state of any entity by replaying its events. An order's state becomes the sum of order.created, payment.completed, shipment.dispatched, and delivery.confirmed events rather than a mutable row in a database.

This approach shines in two scenarios. First, audit trails: financial and healthcare systems often require a complete history of every state change, and webhook events provide that naturally. Second, replay capability: when you discover a bug in your processing logic, you can fix the code and replay the stored events to correct the data. Without event sourcing, you'd need to ask the provider to resend months of webhooks, which most won't do.

But event sourcing adds real complexity. Deriving current state from hundreds of events is slower than reading a single database row. You'll need snapshots to keep query times reasonable. Schema evolution is painful when your event shapes change over 18 months. I'd only reach for this pattern when you have a genuine regulatory or business need for complete event history. For most applications, storing the raw webhook in an append-only log alongside your mutable state gives you 80% of the benefit at 20% of the cost.

Error Isolation

When you have multiple handlers processing webhooks, one slow or failing handler can create backpressure that affects everything else. If your email handler is stuck retrying against a down SMTP server, it shouldn't delay inventory updates or analytics tracking.

The fix is separate queues with independent retry policies. Payment processing gets aggressive retries (5 attempts, 30-second intervals) because financial accuracy matters. Email notifications get gentler retries (3 attempts, 5-minute intervals) because a delayed email is acceptable. Analytics gets fire-and-forget semantics with no retries, since missing a single data point doesn't break anything.

Each queue gets its own worker pool, its own dead letter queue, and its own alerting threshold. When the email service goes down at 2 AM, your payment processing keeps running without interruption. The email dead letter queue fills up, you get an alert in the morning, and you reprocess those jobs after the email service recovers. No customer-facing impact.

Scaling Considerations

Once you've moved to queue-based processing, scaling horizontally is straightforward in theory. Spin up more workers and they'll compete for messages from the queue. In practice, ordering constraints make this harder than it sounds.

Consider this scenario: you receive order.created and order.updated for the same order within 200ms. If two workers pick these up concurrently, the update might execute before the create, causing a foreign key violation or a lost update. Three approaches handle this.

Partitioning by entity: route all events for the same order to the same queue partition. BullMQ supports this with named queues or job groups. This preserves ordering per entity while allowing parallelism across entities.

Optimistic concurrency: let workers process in any order, but use version numbers to detect conflicts. If the update handler sees that the order doesn't exist yet, it waits and retries. This approach requires careful implementation but scales better because you don't need sticky routing.

Idempotent handlers: design every handler so that processing the same event twice (or processing events out of order) produces the correct final state. This is the gold standard but requires significant design effort upfront. Use upserts instead of inserts, store "last processed event timestamp" and skip older events, and make state transitions explicitly check preconditions.

For most teams processing under 10,000 webhooks per minute, partitioning by entity with 5-10 worker instances is the sweet spot. It's simple to reason about, gives you good throughput, and avoids the complexity of optimistic concurrency or fully idempotent designs. When you outgrow that, you'll know, because your queue depth will start climbing during peak hours and won't recover between peaks.

Choosing Your Pattern

These patterns aren't mutually exclusive. A mature webhook integration typically combines several of them: a gateway normalizes incoming events, a queue decouples receipt from processing, fan-out dispatches to multiple handlers, and error isolation prevents cascading failures. You don't need all of this on day one. Start with direct processing, move to queue-based when response times suffer, add fan-out when your handler does more than three things, and introduce a gateway when your third payment provider integration starts feeling painful. Each pattern solves a specific problem, and adding it before you have that problem just creates unnecessary complexity.