Back to Blog
Debugging

Debugging Webhook Failures: A Systematic Approach to Finding What Went Wrong

When webhooks break in production, you need a system for finding the root cause fast. A step-by-step methodology for debugging delivery, parsing, auth, and timing issues.

WebhookVault Team··8 min read

Webhooks Fail Silently, and That Is the Problem

A REST API call fails and you get an error response in your terminal. A database query throws an exception and your error tracker captures it. But a webhook? It fails somewhere between the sender and your handler, and nobody tells you. Orders go unprocessed. Subscriptions stay active after cancellation. Data drifts out of sync. By the time someone notices, hours have passed.

I have spent more time debugging webhook failures than I care to admit, and the single biggest lesson is this: do not guess. Work through the layers systematically, from the outside in. Most teams waste time staring at application code when the webhook never even reached their server.

The Debugging Checklist: Outside In

Every webhook delivery crosses at least 5 layers before your application code runs: DNS resolution, TLS handshake, network routing (firewalls, load balancers), HTTP parsing, and finally your handler. A failure at any layer looks roughly the same from the sender's perspective: a timeout or a non-200 status code.

The mental framework I prefer is to start at the outermost layer and verify each one before moving inward. If you skip ahead to inspecting your signature verification logic when the real problem is an expired TLS certificate, you will waste an afternoon. Verify layer by layer: network, TLS, delivery, parsing, authentication, then application logic. Each layer takes 2-3 minutes to check. The whole process rarely takes more than 15 minutes when you follow the order.

Is It Even Arriving?

The first question is always: did the HTTP request reach your server at all? Check your access logs. If you are behind Cloudflare, AWS ALB, or nginx, check those logs first since your application will never see a request that gets dropped at the edge. Most cloud providers expose access logs through their dashboards or via CLI tools.

If you see no evidence of the request in any log, the problem is upstream. Check your DNS records, verify the webhook URL is correct in the provider's dashboard, and confirm your firewall rules allow inbound traffic on the correct port. A surprising number of webhook failures come down to a misconfigured security group or a DNS record pointing to a stale IP after a migration.

The fastest way to verify your endpoint is reachable is to hit it yourself:

curl -X POST https://your-domain.com/api/webhooks   -H "Content-Type: application/json"   -d '{"test": true}'   -v   --connect-timeout 5

The -v flag gives you the full connection trace: DNS resolution, TLS handshake, headers sent and received. If this times out, your application code is not the problem.

SSL/TLS Issues

TLS problems are sneaky because they often work fine in your browser but fail for automated HTTP clients. An expired certificate, a self-signed cert in production, or a TLS 1.0 endpoint when the sender requires 1.2 minimum will all result in connection failures before any HTTP data is exchanged. The sender sees a connection reset or handshake failure and logs a vague error.

To diagnose, use openssl to inspect the certificate chain directly:

openssl s_client -connect your-domain.com:443 -servername your-domain.com 2>/dev/null | openssl x509 -noout -dates -subject

This shows the certificate's validity dates and subject. If the notAfter date is in the past, you have found your problem. I have seen teams debug signature verification for days when the real issue was a Let's Encrypt certificate that auto-renewed into a different path and the reverse proxy was still serving the old one. Check TLS first. It takes 10 seconds.

Payload Parsing Failures

The request arrived, TLS is fine, but your handler throws an error or returns a 400. The most common cause at this layer is a mismatch between what the sender sends and what your framework expects. Three patterns come up repeatedly: wrong Content-Type header, body size limits, and malformed JSON.

Some webhook providers send JSON with Content-Type: application/x-www-form-urlencoded. Others send form-encoded data when you expect JSON. Your framework's body parser silently produces an empty object or a garbled string, and your code fails downstream with a confusing "cannot read property of undefined" error.

The fix is to preserve the raw body before any parsing occurs. In Express, this middleware captures the raw bytes regardless of content type:

import express from 'express'

const app = express()

// Capture raw body for ALL content types
app.use(express.json({
  limit: '5mb',
  verify: (req: express.Request, _res, buf: Buffer) => {
    (req as any).rawBody = buf
  }
}))

app.use(express.raw({
  type: '*/*',
  limit: '5mb',
  verify: (req: express.Request, _res, buf: Buffer) => {
    if (!(req as any).rawBody) {
      (req as any).rawBody = buf
    }
  }
}))

app.post('/webhooks', (req, res) => {
  const rawBody = (req as any).rawBody as Buffer
  console.log('Content-Type:', req.headers['content-type'])
  console.log('Raw body length:', rawBody?.length ?? 0)
  console.log('Parsed body type:', typeof req.body)

  // Now you can inspect both the raw and parsed versions
  res.status(200).json({ received: true })
})

A common mistake is setting the body size limit too low. The default in Express is 100KB, which sounds generous until a provider sends an event with a nested object containing line items, metadata, and embedded resources. Stripe's invoice.payment_succeeded events regularly exceed 100KB for invoices with many line items. Set your limit to at least 1MB, preferably 5MB, and add monitoring for oversized payloads.

Authentication Mismatches

Signature verification failures are the single most common webhook debugging issue I have seen in production. The symptoms are clear: you compute an HMAC, compare it to the header, and they do not match. But the root cause is rarely "wrong secret." More often, it is body mutation.

Here is what happens: your framework parses the JSON body, then you serialize it back to a string for HMAC computation. But JSON.stringify(JSON.parse(original)) does not produce the same string as the original. Whitespace changes, key ordering can shift, and Unicode escaping may differ. The HMAC is computed on the exact bytes the sender transmitted, not on your re-serialized version.

This is why the raw body middleware from the previous section matters. Always compute signatures against the raw request bytes. When debugging a signature mismatch, add verbose logging that shows exactly what is being compared:

import crypto from 'crypto'
import type { Request, Response, NextFunction } from 'express'

function webhookSignatureDebug(secret: string) {
  return (req: Request, res: Response, next: NextFunction) => {
    const rawBody = (req as any).rawBody as Buffer
    const receivedSig = req.headers['x-webhook-signature'] as string

    if (!rawBody || !receivedSig) {
      console.error('Missing raw body or signature header', {
        hasRawBody: !!rawBody,
        signatureHeader: receivedSig ? '[present]' : '[missing]',
        headers: Object.keys(req.headers).filter(h => h.startsWith('x-'))
      })
      return res.status(401).json({ error: 'Missing signature' })
    }

    const computed = crypto
      .createHmac('sha256', secret)
      .update(rawBody)
      .digest('hex')

    const isValid = crypto.timingSafeEqual(
      Buffer.from(computed, 'hex'),
      Buffer.from(receivedSig, 'hex')
    )

    if (!isValid) {
      console.error('Signature mismatch', {
        computedPrefix: computed.substring(0, 12) + '...',
        receivedPrefix: receivedSig.substring(0, 12) + '...',
        bodyLength: rawBody.length,
        bodyHash: crypto.createHash('sha256').update(rawBody).digest('hex').substring(0, 16),
        encoding: req.headers['content-encoding'] ?? 'none'
      })
      return res.status(401).json({ error: 'Invalid signature' })
    }

    next()
  }
}

Notice that the logging never outputs the full signature or the secret. It logs just enough (the first 12 characters, the body hash, the body length) to identify the mismatch pattern without exposing sensitive data. Other common causes of signature failures: the provider includes a timestamp prefix in the signed content (Stripe does timestamp.body), encoding differences between UTF-8 and Latin-1, and webhook secret rotation where the old secret is still being used by the provider.

Race Conditions and Ordering

Webhook providers do not guarantee delivery order. A customer.subscription.updated event can arrive before the customer.created event that it references. If your handler tries to look up a customer that does not exist yet, it fails. This happens more often than you might expect, especially under load when the provider dispatches events from different workers.

The pattern I prefer for handling out-of-order events is a combination of idempotency checks and deferred retries. When a referenced resource does not exist yet, stash the event and retry after a short delay:

import type { WebhookEvent } from './types'

interface DeferredEvent {
  event: WebhookEvent
  attempts: number
  nextRetry: number
}

const deferred: Map<string, DeferredEvent> = new Map()
const MAX_ATTEMPTS = 5
const RETRY_DELAY_MS = 3000

async function handleEvent(event: WebhookEvent): Promise<boolean> {
  // Check idempotency: skip if already processed
  const existing = await db.webhookEvents.findByProviderId(event.id)
  if (existing) return true

  try {
    await processEvent(event)
    await db.webhookEvents.markProcessed(event.id)
    return true
  } catch (err) {
    if (err instanceof ResourceNotFoundError) {
      // Referenced resource doesn't exist yet, defer and retry
      const key = event.id
      const prev = deferred.get(key)
      const attempts = (prev?.attempts ?? 0) + 1

      if (attempts >= MAX_ATTEMPTS) {
        console.error('Max retries exceeded for deferred event', {
          eventId: event.id,
          eventType: event.type
        })
        return false
      }

      deferred.set(key, {
        event,
        attempts,
        nextRetry: Date.now() + RETRY_DELAY_MS * attempts
      })
      return true // Acknowledge so the provider doesn't retry
    }
    throw err
  }
}

The key detail is returning a 200 status even for deferred events. If you return a 4xx or 5xx, the provider will retry and you end up processing the same out-of-order event repeatedly. Acknowledge receipt, handle it asynchronously, and let your retry logic sort out the ordering. In production, replace the in-memory map with a durable queue like SQS or a database-backed job table.

Timeout Issues

Most webhook senders wait between 5 and 30 seconds for a response. Stripe waits 20 seconds. GitHub waits 10 seconds. If your handler takes longer than that, the sender marks the delivery as failed and schedules a retry, even though your handler is still running. Now you have a handler that completes successfully but the sender keeps retrying, leading to duplicate processing.

The fix is the same pattern used by every high-volume webhook consumer: acknowledge immediately, process later.

import type { Request, Response } from 'express'

app.post('/webhooks', async (req: Request, res: Response) => {
  const rawBody = (req as any).rawBody as Buffer
  const signature = req.headers['x-webhook-signature'] as string

  // Step 1: Verify signature (fast, <5ms)
  if (!verifySignature(rawBody, signature)) {
    return res.status(401).json({ error: 'Invalid signature' })
  }

  // Step 2: Store the raw event (fast, single DB write)
  const eventId = await db.webhookEvents.insert({
    payload: rawBody.toString('utf-8'),
    headers: JSON.stringify(req.headers),
    receivedAt: new Date(),
    status: 'pending'
  })

  // Step 3: Acknowledge immediately
  res.status(200).json({ received: true, eventId })

  // Step 4: Queue for async processing
  await queue.publish('webhook.process', { eventId })
})

This pattern keeps your response time under 50ms regardless of how long the actual processing takes. The tricky part is error handling in the async processor: you need dead-letter queues, retry limits, and alerting for events stuck in "pending" state. But that is a separate concern from the webhook endpoint itself.

A Logging Strategy That Actually Helps

Most webhook logging falls into two extremes: either teams log nothing (and are blind when things fail) or they log everything (and cannot find the signal in the noise). The right approach is structured logging with a clear schema that makes webhook events searchable and debuggable without leaking sensitive data.

Here is what to log for every webhook request: a request ID (generate one if the sender does not provide it), the HTTP method, the event type from the payload, a SHA-256 hash of the body (so you can correlate with the sender's logs without storing the payload in logs), the response status code, and the processing duration in milliseconds. Here is what to never log: webhook signing secrets, full authorization headers, personally identifiable information from the payload, and raw credit card or bank account data.

import crypto from 'crypto'
import type { Request, Response, NextFunction } from 'express'

function webhookLogger(req: Request, res: Response, next: NextFunction) {
  const startTime = process.hrtime.bigint()
  const requestId = (req.headers['x-request-id'] as string) ?? crypto.randomUUID()

  // Capture response status
  const originalSend = res.send.bind(res)
  res.send = function (body: any) {
    const durationMs = Number(process.hrtime.bigint() - startTime) / 1_000_000

    console.log(JSON.stringify({
      type: 'webhook_request',
      requestId,
      method: req.method,
      path: req.path,
      contentType: req.headers['content-type'],
      bodyHash: (req as any).rawBody
        ? crypto.createHash('sha256').update((req as any).rawBody).digest('hex')
        : null,
      bodySize: (req as any).rawBody?.length ?? 0,
      statusCode: res.statusCode,
      durationMs: Math.round(durationMs * 100) / 100,
      timestamp: new Date().toISOString()
    }))

    return originalSend(body)
  }

  next()
}

I prefer JSON-structured logs over plain text because they are trivially searchable in any log aggregation tool. When a webhook fails, you search by body hash to find the exact request, check the duration to see if it timed out, and inspect the status code to narrow down the failure layer. This single middleware has saved me more debugging time than any other technique.

Local Replay for Debugging

Once you have captured a failed webhook's headers and body (from your logs or a webhook inspection tool), replay it locally. This is the fastest path from "broken in production" to "reproduced on my machine." Grab the original headers and body and replay with cURL:

# Replay with original headers and body
curl -X POST http://localhost:3000/api/webhooks   -H "Content-Type: application/json"   -H "X-Webhook-Signature: abc123def456..."   -H "X-Webhook-Timestamp: 1706100000"   -d @failed-webhook-body.json   -v

# Or inline the body
curl -X POST http://localhost:3000/api/webhooks   -H "Content-Type: application/json"   -d '{"type":"payment.completed","data":{"id":"pay_123","amount":4999}}'   -v

For signature-verified endpoints, you will need to either temporarily disable verification in your local environment or recompute the signature with your test secret. I prefer having a SKIP_WEBHOOK_VERIFICATION environment variable that is only set in development. Never deploy this to production.

The replay workflow completes the debugging loop: observe the failure in production, capture the payload, reproduce locally, fix the code, and verify the fix by replaying the same payload. No guesswork, no "it works on my machine" uncertainty. The same bytes that caused the failure are the bytes you use to confirm the fix.

Putting It Together

Webhook debugging does not have to feel like archaeology. With a layered approach (network, TLS, parsing, auth, application logic), structured logging, and a replay workflow, most failures can be diagnosed in under 15 minutes. The investment in raw body preservation, structured logging, and async processing pays for itself the first time a payment webhook fails at 2 AM and you can trace the exact cause from your phone without SSH-ing into a server.

Build these patterns into your webhook infrastructure from the start. Retrofitting observability onto a webhook handler that was written as a quick 20-line endpoint is painful. Starting with the right foundation means the difference between "webhook debugging is a nightmare" and "webhook debugging is a 10-minute routine."