The Webhook That Works... Sort Of
Your webhook endpoint returns 200. The provider's dashboard shows "delivered." Your access logs show the request arrived. But the customer's subscription never activated. The order never shipped. The user never got notified.
This is the nightmare scenario. Not an error that gets logged and alerted on, but a silent success that masks a business logic failure. I have seen teams lose thousands in revenue because a webhook handler acknowledged every request but stopped processing them after a database migration changed a column name. The logs showed green across the board for three days.
Why Silent Failures Happen
Most webhook handlers follow this pattern: receive request, verify signature, parse body, return 200 immediately, queue for async processing. That's the right architecture for keeping response times low. But it creates a gap between "HTTP request succeeded" and "business logic executed."
Here's where things go wrong. Your queue processor throws an exception parsing a field that used to be a string but is now an object. Or it tries to insert a record with a foreign key that doesn't exist yet because of event ordering issues. Or it hits a rate limit on a downstream API. The webhook endpoint has already returned 200, so the sender thinks everything is fine. Your error tracker catches the exception in the background worker, but nobody connects it back to the webhook delivery.
The other common culprit is conditional logic that silently skips processing. Something like this:
async function processWebhook(event) {
const user = await db.users.findById(event.data.userId)
if (!user || !user.isActive) {
// Skip processing for inactive users
return { processed: false }
}
// ... rest of handler
}
Looks reasonable. But then a bug introduces a typo in the user lookup query, so user is always null.
Or a migration changes isActive to status === 'active' and nobody updates this code.
Every webhook returns success, zero errors get logged, and nothing happens.
What Traditional Monitoring Misses
Your APM tool shows webhook requests succeeding. Your error tracker shows no exceptions. Your uptime monitor sees 200 responses. Everything looks healthy. But your business metrics tell a different story: subscriptions created per day dropped by 40%, fulfillment SLA violations spiked, support tickets about "missing confirmations" tripled.
The disconnect is that most monitoring tools measure infrastructure health, not business outcomes. They tell you whether your code ran, not whether it did the right thing. A webhook handler that returns 200 but fails to create the database record is technically successful from an HTTP perspective.
Business Outcome Monitoring
The fix is to track business-level outcomes alongside technical metrics. For every webhook event type, define what "success" actually means in terms of side effects, then monitor whether those side effects happen.
Example: for a subscription.created webhook, success means:
- A record appears in the
subscriptionstable within 30 seconds - The user receives a confirmation email within 60 seconds
- The subscription shows as active in the admin dashboard
Your monitoring should verify all three. If the webhook endpoint returns 200 but no subscription record gets created, that's a failure even if no exception was thrown.
The implementation looks like this: after processing a webhook event, emit a structured event that captures both the HTTP result and the business outcome. Store it somewhere queryable (a database table, a metrics backend like Prometheus, or a log aggregator).
import type { WebhookEvent, ProcessingResult } from './types'
interface WebhookOutcome {
eventId: string
eventType: string
receivedAt: Date
httpStatus: number
processingStatus: 'success' | 'failure' | 'skipped' | 'pending'
businessOutcome?: {
recordCreated: boolean
notificationSent: boolean
downstreamApiCalled: boolean
}
failureReason?: string
processingDurationMs: number
}
async function trackWebhookOutcome(
event: WebhookEvent,
httpStatus: number,
result: ProcessingResult
): Promise {
const outcome: WebhookOutcome = {
eventId: event.id,
eventType: event.type,
receivedAt: new Date(),
httpStatus,
processingStatus: result.status,
businessOutcome: result.businessOutcome,
failureReason: result.error?.message,
processingDurationMs: result.durationMs
}
await db.webhookOutcomes.insert(outcome)
// Also emit as metric for real-time dashboards
metrics.increment('webhook.processed', {
eventType: event.type,
status: result.status,
httpStatus: httpStatus.toString()
})
if (result.status === 'failure') {
metrics.increment('webhook.business_failure', {
eventType: event.type,
reason: result.error?.code ?? 'unknown'
})
}
}
Now you can query for discrepancies. Show me all events where httpStatus = 200 but
processingStatus = 'failure'. Or events where businessOutcome.recordCreated = false.
Or events that are stuck in pending status for more than 5 minutes.
End-to-End Verification with Synthetic Events
Even with business outcome tracking, there's a gap: you only know about failures for events that actually arrive. If the webhook provider's dispatcher breaks and stops sending events entirely, you have no way to detect it until customers complain.
The answer is synthetic monitoring. Every hour, trigger a test event on the sender's side (many providers have APIs for this), then verify that your system processes it correctly within a reasonable timeout. If the synthetic event doesn't result in the expected database record or API call, alert immediately.
import axios from 'axios'
interface SyntheticCheckResult {
timestamp: Date
eventType: string
eventId: string
delivered: boolean
processed: boolean
durationMs: number
failureReason?: string
}
async function runSyntheticWebhookCheck(
eventType: string
): Promise {
const startTime = Date.now()
const syntheticId = `synthetic_${Date.now()}_${Math.random().toString(36).slice(2)}`
try {
// Step 1: Trigger test event via provider's API
await axios.post('https://api.provider.com/webhooks/trigger', {
eventType,
testMode: true,
customId: syntheticId
}, {
headers: { 'Authorization': `Bearer ${process.env.PROVIDER_API_KEY}` }
})
// Step 2: Wait for processing (with timeout)
const maxWaitMs = 30000
const checkIntervalMs = 1000
let elapsed = 0
let processed = false
while (elapsed < maxWaitMs) {
await new Promise(resolve => setTimeout(resolve, checkIntervalMs))
elapsed += checkIntervalMs
const outcome = await db.webhookOutcomes.findByEventId(syntheticId)
if (outcome && outcome.processingStatus === 'success') {
processed = true
break
}
}
const durationMs = Date.now() - startTime
return {
timestamp: new Date(),
eventType,
eventId: syntheticId,
delivered: true,
processed,
durationMs,
failureReason: processed ? undefined : 'Processing timeout'
}
} catch (error) {
return {
timestamp: new Date(),
eventType,
eventId: syntheticId,
delivered: false,
processed: false,
durationMs: Date.now() - startTime,
failureReason: (error as Error).message
}
}
}
// Run every hour via cron
async function runAllSyntheticChecks() {
const criticalEventTypes = [
'subscription.created',
'payment.succeeded',
'user.signup'
]
const results = await Promise.all(
criticalEventTypes.map(type => runSyntheticWebhookCheck(type))
)
const failures = results.filter(r => !r.processed)
if (failures.length > 0) {
await alerting.sendPage({
severity: 'high',
title: 'Webhook synthetic check failures',
body: `${failures.length} webhook types failing end-to-end verification`,
details: failures
})
}
}
The key is running these checks in production against the real webhook flow, not in a separate test environment. You want to verify that the entire pipeline works: DNS, TLS, routing, authentication, queuing, processing, database writes. A test environment can't catch production-specific failures like a misconfigured firewall rule or a database that ran out of connections.
Reconciliation: The Nuclear Option
Even with synthetic checks and business outcome tracking, events can slip through the cracks. Network partitions, transient database errors, out-of-order delivery, race conditions in your processing logic. For critical data, you need a reconciliation layer that periodically verifies your database matches the source of truth.
The pattern is simple: every few hours, fetch a list of entities from the provider's API and compare against your local database. For any mismatches, either backfill the missing data or flag for manual review.
interface ReconciliationReport {
timestamp: Date
resourceType: string
totalRemote: number
totalLocal: number
missing: string[]
unexpected: string[]
reconciled: number
}
async function reconcileSubscriptions(): Promise {
// Fetch all active subscriptions from provider
const remoteSubscriptions = await fetchAllSubscriptionsFromProvider()
const remoteIds = new Set(remoteSubscriptions.map(s => s.id))
// Fetch all subscriptions from local DB
const localSubscriptions = await db.subscriptions.findAll({ status: 'active' })
const localIds = new Set(localSubscriptions.map(s => s.providerId))
// Find discrepancies
const missing = Array.from(remoteIds).filter(id => !localIds.has(id))
const unexpected = Array.from(localIds).filter(id => !remoteIds.has(id))
// Backfill missing subscriptions
let reconciled = 0
for (const missingId of missing) {
const remoteSub = remoteSubscriptions.find(s => s.id === missingId)
if (!remoteSub) continue
await db.subscriptions.create({
providerId: remoteSub.id,
userId: remoteSub.customerId,
status: remoteSub.status,
planId: remoteSub.planId,
createdAt: new Date(remoteSub.created * 1000),
reconciledAt: new Date()
})
reconciled++
}
const report: ReconciliationReport = {
timestamp: new Date(),
resourceType: 'subscriptions',
totalRemote: remoteIds.size,
totalLocal: localIds.size,
missing,
unexpected,
reconciled
}
// Store report for audit trail
await db.reconciliationReports.insert(report)
// Alert if significant drift
if (missing.length > 5 || unexpected.length > 5) {
await alerting.notify({
title: 'Subscription reconciliation drift detected',
body: `Missing: ${missing.length}, Unexpected: ${unexpected.length}`,
severity: 'medium'
})
}
return report
}
Reconciliation is expensive, both in API calls and compute time. Run it during off-peak hours, rate-limit your API requests, and only reconcile critical resources. For a payment provider, reconcile subscriptions and payments. For a CRM, reconcile contacts and deals. For an e-commerce platform, reconcile orders and inventory.
The goal isn't to reconcile in real-time (that defeats the point of webhooks). It's to have a safety net that catches drift over hours or days, before it compounds into a data integrity disaster.
Alert Fatigue vs. Blind Spots
The danger with all this monitoring is alert fatigue. If you alert on every failed webhook, you'll drown in noise. Payment webhooks fail all the time due to transient network issues, and the retry mechanism handles it. But if you don't alert on failures, you miss the silent ones that never recover.
The heuristic I use: alert on patterns, not individual failures. One failed webhook is noise. Five consecutive failures for the same event type is a pattern. A 10% drop in successful processing rate over an hour is a pattern. A synthetic check failing twice in a row is a pattern.
interface AlertThresholds {
consecutiveFailures: number
failureRatePercent: number
syntheticCheckFailures: number
reconciliationDrift: number
}
const thresholds: AlertThresholds = {
consecutiveFailures: 5,
failureRatePercent: 10,
syntheticCheckFailures: 2,
reconciliationDrift: 10
}
async function evaluateAlertConditions() {
const now = Date.now()
const oneHourAgo = now - 3600000
// Check consecutive failures per event type
const recentOutcomes = await db.webhookOutcomes.findAll({
createdAt: { gte: new Date(oneHourAgo) }
})
const byEventType = groupBy(recentOutcomes, o => o.eventType)
for (const [eventType, outcomes] of Object.entries(byEventType)) {
const recent = outcomes.slice(-thresholds.consecutiveFailures)
if (recent.length >= thresholds.consecutiveFailures &&
recent.every(o => o.processingStatus === 'failure')) {
await alerting.sendPage({
title: `Webhook failures: ${eventType}`,
body: `${thresholds.consecutiveFailures} consecutive failures detected`,
severity: 'high'
})
}
}
// Check failure rate
const totalProcessed = recentOutcomes.length
const totalFailed = recentOutcomes.filter(o => o.processingStatus === 'failure').length
const failureRate = (totalFailed / totalProcessed) * 100
if (failureRate >= thresholds.failureRatePercent) {
await alerting.sendPage({
title: 'High webhook failure rate',
body: `${failureRate.toFixed(1)}% failure rate over past hour`,
severity: 'high'
})
}
}
function groupBy(items: T[], keyFn: (item: T) => K): Record {
return items.reduce((acc, item) => {
const key = String(keyFn(item))
if (!acc[key]) acc[key] = []
acc[key].push(item)
return acc
}, {} as Record)
}
Debugging Tools for Silent Failures
When you suspect a silent failure, start with these queries. They surface the gap between "webhook received" and "business outcome achieved."
-- Find webhooks that returned 200 but failed processing
SELECT event_id, event_type, received_at, failure_reason
FROM webhook_outcomes
WHERE http_status = 200
AND processing_status = 'failure'
AND received_at > NOW() - INTERVAL '24 hours'
ORDER BY received_at DESC;
-- Find event types with high skip rates
SELECT event_type,
COUNT(*) as total,
SUM(CASE WHEN processing_status = 'skipped' THEN 1 ELSE 0 END) as skipped,
ROUND(100.0 * SUM(CASE WHEN processing_status = 'skipped' THEN 1 ELSE 0 END) / COUNT(*), 2) as skip_rate_pct
FROM webhook_outcomes
WHERE received_at > NOW() - INTERVAL '7 days'
GROUP BY event_type
HAVING skip_rate_pct > 5
ORDER BY skip_rate_pct DESC;
-- Find events stuck in pending status
SELECT event_id, event_type, received_at,
EXTRACT(EPOCH FROM (NOW() - received_at)) as age_seconds
FROM webhook_outcomes
WHERE processing_status = 'pending'
AND received_at > NOW() - INTERVAL '1 hour'
ORDER BY age_seconds DESC;
The first query catches outright failures masked by a 200 response. The second catches conditional logic that's skipping more events than expected (often a sign that a lookup is broken). The third catches events that got queued but never processed, usually due to a dead worker or a stuck job queue.
Making Silent Failures Loud
The core principle: separate HTTP success from business success, then monitor both. Your webhook endpoint should return 200 to acknowledge receipt, but that doesn't mean processing succeeded. Track the full lifecycle, verify outcomes, run synthetic checks, and reconcile periodically. Only then can you trust that "no errors in the logs" actually means everything is working.
Silent failures are the worst kind because they erode trust slowly. Customers notice subscriptions that don't activate. Orders that don't ship. Notifications that never arrive. By the time you realize something is wrong, the damage is done. Build the observability layer that makes silent failures impossible to miss.