Back to Blog
Webhooks

Monitoring Webhook Health: Metrics, Alerts, and Dashboards That Actually Help

Generic monitoring misses webhook-specific problems. Track success rates, delivery latency, retry depth, and endpoint health scores to catch issues before your customers do.

WebhookVault Team··8 min read

Generic Monitoring Misses Webhook Problems

I've watched teams set up Datadog dashboards with HTTP status code breakdowns, declare their webhook system "fully monitored," and then get blindsided by a silent failure that had been dropping events for 6 hours. The status codes all looked fine, showing 200s across the board. The problem was that an upstream provider had stopped sending events entirely, and nobody was tracking the absence of traffic. Generic application monitoring treats webhooks like any other HTTP endpoint. They are not. Webhook systems fail in ways that standard metrics do not capture: silent delivery gaps, degraded retry performance, slow endpoint drift, and correlated failures across consumer groups.

Google's Site Reliability Engineering book defines four golden signals for monitoring any distributed system: latency, traffic, errors, and saturation. These translate directly to webhooks, but with different meanings than you might assume. Latency is not request processing time. It is end-to-end delivery time from event creation to successful acknowledgment, including retries. Traffic is not requests per second, but events per second per endpoint, and both spikes and drops matter. Errors are not just 5xx responses. They include timeouts, connection resets, TLS failures, and permanent delivery failures after retry exhaustion. Saturation is not CPU or memory. It is queue depth and retry backlog, the pressure building behind your delivery pipeline.

If you only take one thing from this post: monitor the signals that are specific to webhook delivery, not just the ones that come free with your APM tool.

Success Rate: The Single Most Important Metric

Success rate is the metric that tells you whether your webhook system is doing its job. Everything else (latency, retry depth, queue length) is diagnostic. Success rate is the answer to the question your customers are actually asking: "Are my events getting delivered?"

Track it at three levels. Global success rate gives you a system-wide health indicator. Per-endpoint success rate tells you which consumers are having problems. Per-event-type success rate reveals whether specific event categories are failing. Maybe invoice.paid events work fine but subscription.updated events are getting rejected because of a payload schema change. I've seen teams stare at a healthy global rate of 98% while a single endpoint responsible for payment reconciliation was sitting at 40%, buried in the average.

class WebhookSuccessTracker {
  private windows: Map = new Map()
  private readonly windowDurationMs = 15 * 60 * 1000 // 15 minutes
  private readonly maxWindows = 96 // 24 hours of 15-minute windows

  record(endpointId: string, succeeded: boolean): void {
    if (!this.windows.has(endpointId)) {
      this.windows.set(endpointId, [])
    }

    const windows = this.windows.get(endpointId)!
    const now = Date.now()
    const currentWindowStart = now - (now % this.windowDurationMs)

    let current = windows.find(w => w.timestamp === currentWindowStart)
    if (!current) {
      current = { success: 0, failure: 0, timestamp: currentWindowStart }
      windows.push(current)
      // Evict old windows
      while (windows.length > this.maxWindows) windows.shift()
    }

    if (succeeded) current.success++
    else current.failure++
  }

  getSuccessRate(endpointId: string, lookbackMs: number = this.windowDurationMs): number {
    const windows = this.windows.get(endpointId)
    if (!windows || windows.length === 0) return 1

    const cutoff = Date.now() - lookbackMs
    const relevant = windows.filter(w => w.timestamp >= cutoff)
    const total = relevant.reduce((sum, w) => sum + w.success + w.failure, 0)
    const successes = relevant.reduce((sum, w) => sum + w.success, 0)

    return total === 0 ? 1 : successes / total
  }
}

Alert when the success rate for any single endpoint drops below 95% over a 15-minute window. For critical endpoints (payment processors, order management systems), tighten that to 98%. A common mistake is alerting on individual failures. A single 503 does not warrant a page. Trends and sustained degradation do.

Latency: Why Averages Lie

Average delivery latency is a nearly useless metric for webhooks. An average of 200ms sounds healthy until you realize that 1% of your deliveries are taking 45 seconds and timing out. Those outliers represent real events that real users are waiting on: a payment confirmation, a shipping update, a subscription cancellation. The user whose webhook takes 45 seconds does not care that the average is 200ms.

Track P50, P95, and P99. P50 tells you the typical experience. P95 tells you how the slower deliveries behave. P99 tells you about tail latency, the worst-case scenarios that happen often enough to matter at scale. If you deliver 100,000 webhooks per day and your P99 is 30 seconds, that means 1,000 deliveries per day are taking half a minute or longer. At that volume, you will hear about it.

Computing percentiles from a stream of measurements without storing every value requires a sketch data structure. A simple approach is to maintain sorted buckets. For production systems, use a library like prom-client which implements histograms natively. But the concept matters more than the implementation: set your P95 delivery latency alert at 5 seconds and your P99 at 15 seconds. If you breach those thresholds, something has changed. Either the consumer is slowing down or your delivery pipeline is backing up.

Retry Depth Distribution

This is the metric most teams overlook entirely, and it is one of the most informative. Retry depth answers the question: "How many attempts does it take to successfully deliver a webhook?" In a healthy system, 85-90% of deliveries succeed on the first attempt. If you see that number drop to 70%, your endpoints are degraded even if the eventual success rate remains high.

Think of it as leading vs. lagging indicators. Success rate is a lagging indicator: by the time it drops significantly, events have already been lost. Retry depth is a leading indicator. A shift from "mostly first-attempt" to "mostly second or third attempt" signals trouble before it becomes a failure.

class RetryDepthHistogram {
  private buckets: Map = new Map()

  record(endpointId: string, attempts: number): void {
    if (!this.buckets.has(endpointId)) {
      this.buckets.set(endpointId, new Array(10).fill(0))
    }
    const histogram = this.buckets.get(endpointId)!
    const bucket = Math.min(attempts - 1, histogram.length - 1)
    histogram[bucket]++
  }

  getDistribution(endpointId: string): { attempt: number; count: number; pct: string }[] {
    const histogram = this.buckets.get(endpointId)
    if (!histogram) return []

    const total = histogram.reduce((sum, count) => sum + count, 0)
    return histogram.map((count, i) => ({
      attempt: i + 1,
      count,
      pct: total > 0 ? ((count / total) * 100).toFixed(1) + '%' : '0%',
    }))
  }
}

Alert when more than 20% of deliveries for any endpoint require 3 or more attempts. That pattern almost always precedes a full outage. In practice, I've found that the retry depth distribution reveals intermittent problems (flaky load balancers, DNS resolution hiccups, aggressive rate limiting) that success rate alone never surfaces.

Endpoint Health Scores

Individual metrics are useful for debugging. For an operational overview, the kind that lets an on-call engineer glance at a dashboard at 3 AM and know where the problems are, you need a composite score. An endpoint health score combines success rate, latency, and retry depth into a single number between 0 and 100.

The weighting matters. Success rate should dominate because it directly reflects whether events are reaching the consumer. Latency and retry depth are secondary signals that indicate degradation. I weight success rate at 60%, inverse latency at 20%, and retry efficiency at 20%. A perfect endpoint scores 100: every delivery succeeds on the first attempt with sub-second latency. A failing endpoint scores near 0.

interface EndpointMetrics {
  successRate: number       // 0 to 1
  p95LatencyMs: number      // in milliseconds
  avgRetryDepth: number     // average attempts per delivery
}

function calculateHealthScore(metrics: EndpointMetrics): number {
  // Success rate: 60% weight, linear scale
  const successScore = metrics.successRate * 100

  // Latency: 20% weight, inverse scale with 10s ceiling
  // 0ms = 100, 5000ms = 50, 10000ms+ = 0
  const latencyScore = Math.max(0, 100 - (metrics.p95LatencyMs / 100))

  // Retry depth: 20% weight
  // 1.0 attempts = 100, 3.0 attempts = 0
  const retryScore = Math.max(0, Math.min(100, (3 - metrics.avgRetryDepth) / 2 * 100))

  const score = (successScore * 0.6) + (latencyScore * 0.2) + (retryScore * 0.2)
  return Math.round(Math.max(0, Math.min(100, score)))
}

Sort your dashboard by health score ascending. The worst endpoints float to the top. In my experience, this single view replaces three or four separate metric panels and makes triage dramatically faster. A score below 70 warrants investigation. Below 50, something is actively broken.

Alerting Rules That Reduce Noise

Bad alerting is worse than no alerting. I've seen teams disable their webhook alerts because they fire 40 times a day on transient blips. Then a real outage happens, and nobody notices for 2 hours because the alerts were already muted. The problem is not the alerting system. It is the rules.

Write alerts that are specific, windowed, and actionable. "Error rate high" is none of those things. "Endpoint ep_abc123 success rate dropped below 90% over the last 15 minutes (currently 72%, 34 failures out of 120 attempts)." That tells you what is wrong, where, and how bad it is.

Here are four alert rules that cover 90% of webhook failure modes:

  • Success rate drop: any endpoint falls below 90% success over a 15-minute window. Severity: warning. Below 70%: severity critical.
  • Delivery latency spike: P95 latency exceeds 10 seconds for any endpoint over 5 minutes. This catches slow consumers before they start timing out entirely.
  • Retry storm: system-wide retry rate exceeds 25% of total delivery traffic over 10 minutes. Something big is failing and your retry backlog is growing.
  • Zero traffic: an endpoint that normally receives 50+ events per hour has received 0 in the last 30 minutes. Silent failures are the hardest to catch and often the most damaging.

Suppress duplicate alerts for the same endpoint within a 30-minute cooldown. Group related alerts: if five endpoints served by the same consumer all degrade simultaneously, that is one incident, not five. Route critical alerts to PagerDuty. Route warnings to Slack. Never page on a warning.

Building a Webhook Health Dashboard

A well-structured dashboard answers three questions in order: "Is the system healthy?" "Which endpoints have problems?" "What went wrong?" Lay it out in rows that follow that hierarchy.

Top row: global health. Four stat panels showing global success rate (last hour), total events delivered (last 24h), active endpoint count, and current retry queue depth. These should be glanceable, turning green when healthy and red when not. A single time-series graph showing global success rate over the last 7 days gives you trend context.

Middle row: per-endpoint breakdown. A table sorted by health score (ascending) with columns for endpoint name, success rate, P95 latency, retry depth, and last successful delivery timestamp. Color-code rows: green above 90%, yellow 70-90%, red below 70%. This is where you go when the top row shows a problem.

Bottom row: recent failures. A log panel showing the last 50 failed deliveries with timestamp, endpoint, HTTP status, error message, and attempt number. This is your drill-down for active debugging.

If you are using Prometheus and Grafana, these queries form the backbone:

# Global success rate (last 15 minutes)
sum(rate(webhook_deliveries_total{status="success"}[15m]))
/
sum(rate(webhook_deliveries_total[15m]))

# Per-endpoint P95 latency
histogram_quantile(0.95,
  sum by (endpoint_id, le) (
    rate(webhook_delivery_duration_seconds_bucket[15m])
  )
)

# Retry rate as percentage of total traffic
sum(rate(webhook_retries_total[10m]))
/
sum(rate(webhook_deliveries_total[10m])) * 100

# Zero-traffic detection (endpoints with no events in 30 min)
sum by (endpoint_id) (increase(webhook_deliveries_total[30m])) == 0

Incident Response: When an Alert Fires

An alert without a runbook is just noise with urgency. When your webhook health alert fires at 2 AM, the on-call engineer should not have to reverse-engineer the system. They need a checklist.

Step 1: Assess scope

Is this a single endpoint or multiple? Check the dashboard. A single failing endpoint is likely a consumer-side problem. Multiple endpoints degrading simultaneously suggests a systemic issue: your delivery infrastructure, a shared network path, or a cloud provider incident.

Step 2: Check recent changes

Did a deployment happen in the last hour? Pull up your deployment log. Webhook delivery failures that correlate tightly with a deploy timestamp are almost always caused by that deploy, whether a configuration change, a broken dependency, a changed payload format.

Step 3: Inspect failure details

Look at the actual error responses. Are you getting 502s? The consumer's reverse proxy cannot reach the application server. 503s? The consumer is explicitly rejecting traffic. Timeouts? The consumer is processing too slowly. Connection refused? The consumer process is not running. Each pattern points to a different root cause.

Step 4: Contact or pause

For consumer-side failures: notify the consumer's team through whatever channel is established. For systemic issues: consider pausing delivery to affected endpoints to prevent retry storms from making things worse. Pausing does not mean losing events. They should queue and drain when delivery resumes. If events are piling up and the issue is not resolving quickly, check that your queue has enough capacity for the backlog.

Step 5: Post-incident

After resolution, verify that all queued events were delivered. Check the dead letter queue for events that exhausted retries during the incident. Replay them manually if needed. Update your runbook with anything you learned. If a new failure mode appeared that your monitoring did not catch, add a new alert rule for it.

Monitoring is not a set-and-forget activity. Every incident that surprises you is a gap in your observability. The goal is not to prevent all failures, because that is impossible in distributed systems. The goal is to detect failures faster than your customers do and resolve them before the impact compounds. The teams that do this well treat their monitoring configuration as code: versioned, reviewed, and continuously improved alongside the systems it watches.