Back to Blog
Debugging

Your Webhooks Are Failing and Nobody Knows

Most webhook failures never trigger an alert. They just vanish. Here is how to catch the silent ones before your customers do.

WebhookVault Team··7 min read

The Webhook That Worked Fine (Until It Didn't)

Picture this. Your payment provider sends a webhook for a successful charge. Your endpoint returns 200. The money lands in your account. Everything looks perfect in the dashboard.

Except the customer's subscription status never updated. The webhook payload contained a new field your parser didn't expect, your code swallowed the error, returned 200 anyway, and moved on. The customer emails support three days later wondering why they lost access to their account despite paying.

This happens constantly.

Why 200 OK Is the Most Dangerous Response Code

Teams obsess over 500 errors. They set up PagerDuty alerts, Slack notifications, the whole incident response playbook. But the webhook failures that actually hurt are the ones returning 200.

A 200 tells the sender everything went great. No retry. No alert on their end. No second chance. If your handler caught an exception somewhere deep in the processing logic and your top-level catch block still returned a success response, that event is gone. Stripe won't resend it. GitHub won't retry. Shopify considers it delivered.

You essentially told the sender "got it, all good" while quietly dropping the event on the floor.

// This pattern kills webhooks silently
app.post('/webhooks/payment', async (req, res) => {
  try {
    const event = parseWebhookPayload(req.body)
    await processPayment(event)
    await updateSubscription(event) // throws because of new field
    await sendConfirmationEmail(event)
    res.status(200).send('ok')
  } catch (err) {
    console.error('webhook error:', err) // logged, forgotten
    res.status(200).send('ok') // WHY IS THIS STILL 200
  }
})

I've seen this exact pattern in production at three different fintech companies. The reasoning is always the same: "we don't want the provider to keep retrying and flooding us." Which, sure. But you're trading retry storms for silent data loss. That's not a good trade.

Partial Processing Is the Real Enemy

The scariest webhook failures aren't total failures. They're partial ones. Your handler does five things with an incoming event. Three succeed. Two fail. What's the correct HTTP response?

If you return 500, the sender retries the whole event. Now those three successful operations run again. Duplicate charges, duplicate emails, duplicate records. If you return 200, those two failed operations never complete.

Neither option works.

The fix is making each operation idempotent and tracking completion at the step level. Store which steps completed for each event ID. On retry, skip the ones that already succeeded.

async function handleWebhook(event) {
  const eventId = event.id
  const completed = await getCompletedSteps(eventId)

  const steps = [
    { name: 'validate_signature', fn: () => validateSignature(event) },
    { name: 'update_subscription', fn: () => updateSubscription(event) },
    { name: 'charge_invoice', fn: () => chargeInvoice(event) },
    { name: 'send_receipt', fn: () => sendReceipt(event) },
  ]

  for (const step of steps) {
    if (completed.includes(step.name)) continue

    try {
      await step.fn()
      await markStepCompleted(eventId, step.name)
    } catch (err) {
      await logFailedStep(eventId, step.name, err)
      // return 500 so the sender retries, completed steps get skipped
      throw new WebhookProcessingError(step.name, err)
    }
  }
}

More code? Yes. More database writes? Also yes. But now partial failures are recoverable instead of invisible.

The Logging Problem Nobody Talks About

You're logging webhook errors. Great. Where?

If the answer is "application logs," you've got maybe 72 hours before those rotate out on most setups. Webhook issues often surface days or weeks later when a customer finally notices something's off. By then your logs are gone, and you're left guessing what happened.

Webhook events need their own storage. Separate table, separate index, whatever. But they need to persist longer than your standard log retention. And they need to be queryable by event type, source, timestamp, and processing status.

CREATE TABLE webhook_events (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  source VARCHAR(50) NOT NULL,        -- 'stripe', 'github', etc
  event_type VARCHAR(100) NOT NULL,
  event_id VARCHAR(255) NOT NULL,      -- provider's event ID
  payload JSONB NOT NULL,
  status VARCHAR(20) DEFAULT 'received', -- received, processing, completed, failed
  error_message TEXT,
  attempts INT DEFAULT 0,
  received_at TIMESTAMPTZ DEFAULT NOW(),
  processed_at TIMESTAMPTZ,
  UNIQUE(source, event_id)             -- dedup built in
);

Index on status and received_at. Run a daily query for anything stuck in 'received' or 'processing' for more than an hour. That query alone will catch failures you never knew existed.

Timeout Failures Are Sneaky

Most webhook senders give you somewhere between 5 and 30 seconds to respond. Stripe gives you about 20 seconds. GitHub is more generous at 30. Shopify cuts you off at 5.

Five seconds.

If your handler does any external API call, any database write that hits a slow query, any third-party service lookup, you can easily blow past 5 seconds. The sender sees a timeout, marks it as failed, schedules a retry. Your handler might still be running. Now you've got the original request completing in the background AND a retry coming in.

The pattern that works: accept fast, process later. Return 200 immediately after validating the signature and storing the raw payload. Process everything else asynchronously.

app.post('/webhooks/shopify', async (req, res) => {
  // step 1: validate (fast, no external calls)
  if (!verifyShopifyHmac(req)) {
    return res.status(401).send('invalid signature')
  }

  // step 2: store raw event (single DB write)
  await storeRawEvent('shopify', req.body)

  // step 3: acknowledge immediately
  res.status(200).send('accepted')

  // step 4: actual processing happens via queue/worker
  // NOT here, not in this request lifecycle
})

Your background worker picks up events from the store and processes them with proper retry logic, error handling, and no timeout pressure. This is how every high-volume webhook consumer I've worked with eventually ends up doing it, usually after getting burned by timeouts first.

Monitoring That Actually Catches Things

Alerts on error rates are table stakes. You need to also monitor for the absence of events.

If Stripe normally sends you 50 to 100 webhook events per hour and suddenly sends zero for two hours, something is broken. Maybe on their end. Maybe your endpoint is unreachable and they've started queueing retries. Maybe someone deployed a firewall rule that blocks their IPs.

Set up a dead man's switch. If you haven't received any webhooks from a critical source in X minutes, alert. The threshold depends on your volume, but even low-traffic systems should flag a 24-hour gap.

// simple check, run it every 15 minutes
async function checkWebhookHealth() {
  const sources = ['stripe', 'github', 'shopify']

  for (const source of sources) {
    const lastEvent = await db.query(
      'SELECT MAX(received_at) as last_seen FROM webhook_events WHERE source = ${source}'
    )

    const hoursSinceLastEvent =
      (Date.now() - new Date(lastEvent.last_seen).getTime()) / 3600000

    if (hoursSinceLastEvent > THRESHOLDS[source]) {
      await alert(`No webhooks from ${source} in ${hoursSinceLastEvent.toFixed(1)} hours`)
    }
  }
}

Combine this with response code monitoring and processing duration tracking. If your average processing time jumps from 200ms to 2 seconds, you want to know before timeouts start hitting.

Stop Trusting Your Happy Path

Most teams test webhook handling by sending a well-formed event and confirming it processes correctly. That covers maybe 20% of what can go wrong.

Test with malformed payloads. Test with missing fields. Test with duplicate event IDs. Test with events arriving out of order. Send an invoice.paid before invoice.created and see what your system does. Send the same event three times in rapid succession.

The webhook providers will do all of these things to you in production. Not because they're broken, but because distributed systems are inherently messy. Events get redelivered. Order isn't guaranteed. Payloads evolve without warning.

Build your error handling for the messy reality, not the clean documentation. Your future self will thank you at 2 AM when something weird happens and your system just handles it instead of paging you.