Back to Blog
Debugging

When Webhooks Return 200 OK But Nothing Happens

Silent webhook failures are the worst kind. Your endpoint returns success, logs look clean, but business logic never runs. Here's how to catch what your monitoring misses.

WebhookVault Team··7 min read

The Webhook That Works... Sort Of

Your webhook endpoint returns 200. The provider's dashboard shows "delivered." Your access logs show the request arrived. But the customer's subscription never activated. The order never shipped. The user never got notified.

This is the nightmare scenario. Not an error that gets logged and alerted on, but a silent success that masks a business logic failure. I have seen teams lose thousands in revenue because a webhook handler acknowledged every request but stopped processing them after a database migration changed a column name. The logs showed green across the board for three days.

Why Silent Failures Happen

Most webhook handlers follow this pattern: receive request, verify signature, parse body, return 200 immediately, queue for async processing. That's the right architecture for keeping response times low. But it creates a gap between "HTTP request succeeded" and "business logic executed."

Here's where things go wrong. Your queue processor throws an exception parsing a field that used to be a string but is now an object. Or it tries to insert a record with a foreign key that doesn't exist yet because of event ordering issues. Or it hits a rate limit on a downstream API. The webhook endpoint has already returned 200, so the sender thinks everything is fine. Your error tracker catches the exception in the background worker, but nobody connects it back to the webhook delivery.

The other common culprit is conditional logic that silently skips processing. Something like this:

async function processWebhook(event) {
  const user = await db.users.findById(event.data.userId)
  
  if (!user || !user.isActive) {
    // Skip processing for inactive users
    return { processed: false }
  }
  
  // ... rest of handler
}

Looks reasonable. But then a bug introduces a typo in the user lookup query, so user is always null. Or a migration changes isActive to status === 'active' and nobody updates this code. Every webhook returns success, zero errors get logged, and nothing happens.

What Traditional Monitoring Misses

Your APM tool shows webhook requests succeeding. Your error tracker shows no exceptions. Your uptime monitor sees 200 responses. Everything looks healthy. But your business metrics tell a different story: subscriptions created per day dropped by 40%, fulfillment SLA violations spiked, support tickets about "missing confirmations" tripled.

The disconnect is that most monitoring tools measure infrastructure health, not business outcomes. They tell you whether your code ran, not whether it did the right thing. A webhook handler that returns 200 but fails to create the database record is technically successful from an HTTP perspective.

Business Outcome Monitoring

The fix is to track business-level outcomes alongside technical metrics. For every webhook event type, define what "success" actually means in terms of side effects, then monitor whether those side effects happen.

Example: for a subscription.created webhook, success means:

  • A record appears in the subscriptions table within 30 seconds
  • The user receives a confirmation email within 60 seconds
  • The subscription shows as active in the admin dashboard

Your monitoring should verify all three. If the webhook endpoint returns 200 but no subscription record gets created, that's a failure even if no exception was thrown.

The implementation looks like this: after processing a webhook event, emit a structured event that captures both the HTTP result and the business outcome. Store it somewhere queryable (a database table, a metrics backend like Prometheus, or a log aggregator).

import type { WebhookEvent, ProcessingResult } from './types'

interface WebhookOutcome {
  eventId: string
  eventType: string
  receivedAt: Date
  httpStatus: number
  processingStatus: 'success' | 'failure' | 'skipped' | 'pending'
  businessOutcome?: {
    recordCreated: boolean
    notificationSent: boolean
    downstreamApiCalled: boolean
  }
  failureReason?: string
  processingDurationMs: number
}

async function trackWebhookOutcome(
  event: WebhookEvent,
  httpStatus: number,
  result: ProcessingResult
): Promise {
  const outcome: WebhookOutcome = {
    eventId: event.id,
    eventType: event.type,
    receivedAt: new Date(),
    httpStatus,
    processingStatus: result.status,
    businessOutcome: result.businessOutcome,
    failureReason: result.error?.message,
    processingDurationMs: result.durationMs
  }

  await db.webhookOutcomes.insert(outcome)
  
  // Also emit as metric for real-time dashboards
  metrics.increment('webhook.processed', {
    eventType: event.type,
    status: result.status,
    httpStatus: httpStatus.toString()
  })
  
  if (result.status === 'failure') {
    metrics.increment('webhook.business_failure', {
      eventType: event.type,
      reason: result.error?.code ?? 'unknown'
    })
  }
}

Now you can query for discrepancies. Show me all events where httpStatus = 200 but processingStatus = 'failure'. Or events where businessOutcome.recordCreated = false. Or events that are stuck in pending status for more than 5 minutes.

End-to-End Verification with Synthetic Events

Even with business outcome tracking, there's a gap: you only know about failures for events that actually arrive. If the webhook provider's dispatcher breaks and stops sending events entirely, you have no way to detect it until customers complain.

The answer is synthetic monitoring. Every hour, trigger a test event on the sender's side (many providers have APIs for this), then verify that your system processes it correctly within a reasonable timeout. If the synthetic event doesn't result in the expected database record or API call, alert immediately.

import axios from 'axios'

interface SyntheticCheckResult {
  timestamp: Date
  eventType: string
  eventId: string
  delivered: boolean
  processed: boolean
  durationMs: number
  failureReason?: string
}

async function runSyntheticWebhookCheck(
  eventType: string
): Promise {
  const startTime = Date.now()
  const syntheticId = `synthetic_${Date.now()}_${Math.random().toString(36).slice(2)}`
  
  try {
    // Step 1: Trigger test event via provider's API
    await axios.post('https://api.provider.com/webhooks/trigger', {
      eventType,
      testMode: true,
      customId: syntheticId
    }, {
      headers: { 'Authorization': `Bearer ${process.env.PROVIDER_API_KEY}` }
    })

    // Step 2: Wait for processing (with timeout)
    const maxWaitMs = 30000
    const checkIntervalMs = 1000
    let elapsed = 0
    let processed = false

    while (elapsed < maxWaitMs) {
      await new Promise(resolve => setTimeout(resolve, checkIntervalMs))
      elapsed += checkIntervalMs

      const outcome = await db.webhookOutcomes.findByEventId(syntheticId)
      if (outcome && outcome.processingStatus === 'success') {
        processed = true
        break
      }
    }

    const durationMs = Date.now() - startTime

    return {
      timestamp: new Date(),
      eventType,
      eventId: syntheticId,
      delivered: true,
      processed,
      durationMs,
      failureReason: processed ? undefined : 'Processing timeout'
    }
  } catch (error) {
    return {
      timestamp: new Date(),
      eventType,
      eventId: syntheticId,
      delivered: false,
      processed: false,
      durationMs: Date.now() - startTime,
      failureReason: (error as Error).message
    }
  }
}

// Run every hour via cron
async function runAllSyntheticChecks() {
  const criticalEventTypes = [
    'subscription.created',
    'payment.succeeded',
    'user.signup'
  ]

  const results = await Promise.all(
    criticalEventTypes.map(type => runSyntheticWebhookCheck(type))
  )

  const failures = results.filter(r => !r.processed)
  
  if (failures.length > 0) {
    await alerting.sendPage({
      severity: 'high',
      title: 'Webhook synthetic check failures',
      body: `${failures.length} webhook types failing end-to-end verification`,
      details: failures
    })
  }
}

The key is running these checks in production against the real webhook flow, not in a separate test environment. You want to verify that the entire pipeline works: DNS, TLS, routing, authentication, queuing, processing, database writes. A test environment can't catch production-specific failures like a misconfigured firewall rule or a database that ran out of connections.

Reconciliation: The Nuclear Option

Even with synthetic checks and business outcome tracking, events can slip through the cracks. Network partitions, transient database errors, out-of-order delivery, race conditions in your processing logic. For critical data, you need a reconciliation layer that periodically verifies your database matches the source of truth.

The pattern is simple: every few hours, fetch a list of entities from the provider's API and compare against your local database. For any mismatches, either backfill the missing data or flag for manual review.

interface ReconciliationReport {
  timestamp: Date
  resourceType: string
  totalRemote: number
  totalLocal: number
  missing: string[]
  unexpected: string[]
  reconciled: number
}

async function reconcileSubscriptions(): Promise {
  // Fetch all active subscriptions from provider
  const remoteSubscriptions = await fetchAllSubscriptionsFromProvider()
  const remoteIds = new Set(remoteSubscriptions.map(s => s.id))

  // Fetch all subscriptions from local DB
  const localSubscriptions = await db.subscriptions.findAll({ status: 'active' })
  const localIds = new Set(localSubscriptions.map(s => s.providerId))

  // Find discrepancies
  const missing = Array.from(remoteIds).filter(id => !localIds.has(id))
  const unexpected = Array.from(localIds).filter(id => !remoteIds.has(id))

  // Backfill missing subscriptions
  let reconciled = 0
  for (const missingId of missing) {
    const remoteSub = remoteSubscriptions.find(s => s.id === missingId)
    if (!remoteSub) continue

    await db.subscriptions.create({
      providerId: remoteSub.id,
      userId: remoteSub.customerId,
      status: remoteSub.status,
      planId: remoteSub.planId,
      createdAt: new Date(remoteSub.created * 1000),
      reconciledAt: new Date()
    })
    reconciled++
  }

  const report: ReconciliationReport = {
    timestamp: new Date(),
    resourceType: 'subscriptions',
    totalRemote: remoteIds.size,
    totalLocal: localIds.size,
    missing,
    unexpected,
    reconciled
  }

  // Store report for audit trail
  await db.reconciliationReports.insert(report)

  // Alert if significant drift
  if (missing.length > 5 || unexpected.length > 5) {
    await alerting.notify({
      title: 'Subscription reconciliation drift detected',
      body: `Missing: ${missing.length}, Unexpected: ${unexpected.length}`,
      severity: 'medium'
    })
  }

  return report
}

Reconciliation is expensive, both in API calls and compute time. Run it during off-peak hours, rate-limit your API requests, and only reconcile critical resources. For a payment provider, reconcile subscriptions and payments. For a CRM, reconcile contacts and deals. For an e-commerce platform, reconcile orders and inventory.

The goal isn't to reconcile in real-time (that defeats the point of webhooks). It's to have a safety net that catches drift over hours or days, before it compounds into a data integrity disaster.

Alert Fatigue vs. Blind Spots

The danger with all this monitoring is alert fatigue. If you alert on every failed webhook, you'll drown in noise. Payment webhooks fail all the time due to transient network issues, and the retry mechanism handles it. But if you don't alert on failures, you miss the silent ones that never recover.

The heuristic I use: alert on patterns, not individual failures. One failed webhook is noise. Five consecutive failures for the same event type is a pattern. A 10% drop in successful processing rate over an hour is a pattern. A synthetic check failing twice in a row is a pattern.

interface AlertThresholds {
  consecutiveFailures: number
  failureRatePercent: number
  syntheticCheckFailures: number
  reconciliationDrift: number
}

const thresholds: AlertThresholds = {
  consecutiveFailures: 5,
  failureRatePercent: 10,
  syntheticCheckFailures: 2,
  reconciliationDrift: 10
}

async function evaluateAlertConditions() {
  const now = Date.now()
  const oneHourAgo = now - 3600000

  // Check consecutive failures per event type
  const recentOutcomes = await db.webhookOutcomes.findAll({
    createdAt: { gte: new Date(oneHourAgo) }
  })

  const byEventType = groupBy(recentOutcomes, o => o.eventType)

  for (const [eventType, outcomes] of Object.entries(byEventType)) {
    const recent = outcomes.slice(-thresholds.consecutiveFailures)
    if (recent.length >= thresholds.consecutiveFailures &&
        recent.every(o => o.processingStatus === 'failure')) {
      await alerting.sendPage({
        title: `Webhook failures: ${eventType}`,
        body: `${thresholds.consecutiveFailures} consecutive failures detected`,
        severity: 'high'
      })
    }
  }

  // Check failure rate
  const totalProcessed = recentOutcomes.length
  const totalFailed = recentOutcomes.filter(o => o.processingStatus === 'failure').length
  const failureRate = (totalFailed / totalProcessed) * 100

  if (failureRate >= thresholds.failureRatePercent) {
    await alerting.sendPage({
      title: 'High webhook failure rate',
      body: `${failureRate.toFixed(1)}% failure rate over past hour`,
      severity: 'high'
    })
  }
}

function groupBy(items: T[], keyFn: (item: T) => K): Record {
  return items.reduce((acc, item) => {
    const key = String(keyFn(item))
    if (!acc[key]) acc[key] = []
    acc[key].push(item)
    return acc
  }, {} as Record)
}

Debugging Tools for Silent Failures

When you suspect a silent failure, start with these queries. They surface the gap between "webhook received" and "business outcome achieved."

-- Find webhooks that returned 200 but failed processing
SELECT event_id, event_type, received_at, failure_reason
FROM webhook_outcomes
WHERE http_status = 200
  AND processing_status = 'failure'
  AND received_at > NOW() - INTERVAL '24 hours'
ORDER BY received_at DESC;

-- Find event types with high skip rates
SELECT event_type,
       COUNT(*) as total,
       SUM(CASE WHEN processing_status = 'skipped' THEN 1 ELSE 0 END) as skipped,
       ROUND(100.0 * SUM(CASE WHEN processing_status = 'skipped' THEN 1 ELSE 0 END) / COUNT(*), 2) as skip_rate_pct
FROM webhook_outcomes
WHERE received_at > NOW() - INTERVAL '7 days'
GROUP BY event_type
HAVING skip_rate_pct > 5
ORDER BY skip_rate_pct DESC;

-- Find events stuck in pending status
SELECT event_id, event_type, received_at,
       EXTRACT(EPOCH FROM (NOW() - received_at)) as age_seconds
FROM webhook_outcomes
WHERE processing_status = 'pending'
  AND received_at > NOW() - INTERVAL '1 hour'
ORDER BY age_seconds DESC;

The first query catches outright failures masked by a 200 response. The second catches conditional logic that's skipping more events than expected (often a sign that a lookup is broken). The third catches events that got queued but never processed, usually due to a dead worker or a stuck job queue.

Making Silent Failures Loud

The core principle: separate HTTP success from business success, then monitor both. Your webhook endpoint should return 200 to acknowledge receipt, but that doesn't mean processing succeeded. Track the full lifecycle, verify outcomes, run synthetic checks, and reconcile periodically. Only then can you trust that "no errors in the logs" actually means everything is working.

Silent failures are the worst kind because they erode trust slowly. Customers notice subscriptions that don't activate. Orders that don't ship. Notifications that never arrive. By the time you realize something is wrong, the damage is done. Build the observability layer that makes silent failures impossible to miss.