The Delivery Guarantee Spectrum
Every webhook system makes a promise about delivery. Sometimes the promise is explicit, documented in an API guide with retry counts and timeout windows. More often it is implicit, baked into the code without anyone having thought carefully about what happens when things go wrong. Understanding the three levels of delivery guarantees will save you from building a system that silently drops events or, worse, processes the same payment twice.
Think of it like sending a letter. At-most-once is dropping a postcard in a mailbox with no return address. If it gets lost, you will never know. At-least-once is sending a registered letter with delivery confirmation. If you do not get the confirmation back, you send it again, accepting that the recipient might get two copies. Exactly-once is a bank wire transfer where both sides coordinate through a shared ledger to guarantee the money moves precisely once. The first two are straightforward to implement. The third, in the context of webhooks over HTTP, is a theoretical ideal that distributed systems cannot actually achieve.
At-Most-Once: Fire and Forget
The simplest delivery model. Send the webhook. If it fails, move on. No retries, no tracking, no delivery state. A single HTTP request, and whatever happens, happens.
async function sendWebhookFireAndForget(
url: string,
payload: object
): Promise {
try {
await fetch(url, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(payload),
signal: AbortSignal.timeout(5_000),
})
} catch {
// Delivery failed. No retry. Move on.
}
}
This is fast. No retry queues, no state to manage, no background workers. The trade-off is obvious: you will lose events. Network blips, server restarts, DNS hiccups: any transient failure means the event vanishes. I have seen teams use at-most-once for analytics pings, real-time UI notifications, and cache invalidation signals where missing an occasional event is acceptable. If a user's dashboard does not update for 5 seconds because one Server-Sent Event got dropped, nobody notices. If a payment confirmation gets dropped, that is a different story.
At-Least-Once: Retry Until Confirmed
This is the delivery model that Stripe, GitHub, Shopify, and virtually every serious webhook provider uses. The logic: send the webhook, wait for a 2xx response, and if you do not get one, retry with exponential backoff until you either succeed or exhaust your retry budget.
The guarantee is that every event will be delivered at least one time, assuming the receiver eventually comes back online within your retry window. The cost is duplicate deliveries. When the receiver processes your webhook and starts sending back a 200, but the connection drops before you receive that response, you have no way to tell the difference between "they never got it" and "they got it but I did not get the ack." So you retry. They get it again.
In practice, duplicate rates are low, typically under 0.1% for well-built systems. But "low" is not "zero." For payment processing, order fulfillment, or any state-changing operation, even a 0.01% duplicate rate causes real problems at scale. A system processing 1 million webhooks per day at 0.01% duplicates will double-process 100 events daily. That is why at-least-once delivery always pairs with idempotent consumers on the receiving side.
Exactly-Once: Why It Is Nearly Impossible
The Two Generals' Problem, formalized in 1975, proves that two parties communicating over an unreliable channel cannot reach guaranteed agreement. Applied to webhooks: the sender sends a message and waits for an acknowledgment. If the acknowledgment never arrives, the sender cannot distinguish between the receiver never getting the message and the acknowledgment getting lost on the way back. No amount of additional messages fixes this because each additional acknowledgment is itself subject to the same uncertainty.
This means exactly-once delivery over HTTP is not achievable. Full stop. But here is the distinction that matters: exactly-once processing is absolutely achievable. You accept that the message may arrive more than once, and you build the receiver to handle duplicates gracefully. The sender provides at-least-once delivery. The receiver provides idempotent processing. Together, the system behaves as if each event was processed exactly once, even though the network made no such guarantee. This is how every reliable webhook integration works in production.
The Transactional Outbox Pattern
Before talking about how receivers handle duplicates, we need to address a fundamental problem on the sending side: the dual-write problem. Most webhook senders need to do two things when a business event happens: update the database and send a webhook. These are two separate operations, and without coordination, things go wrong in predictable ways.
Scenario one: you update the database, then send the webhook. If the send fails (network error, process crash), the database reflects a change that was never communicated. The receiver misses the event. Scenario two: you send the webhook first, then update the database. If the database write fails, you have notified the receiver about something that did not actually happen. A phantom event. Both scenarios create inconsistency between your system and your consumers. I have seen this bug in production systems more times than I can count, and it is almost always discovered by a customer reporting missing or phantom events weeks after the code shipped.
The transactional outbox pattern solves this cleanly. Instead of sending the webhook directly, you write the event to an outbox table in the same database transaction as the business data change. A separate background process reads from the outbox and handles delivery. Because the business write and the outbox write are in the same transaction, they either both commit or both roll back. No lost events, no phantom events.
-- Outbox table schema
CREATE TABLE webhook_outbox (
id BIGSERIAL PRIMARY KEY,
event_type VARCHAR(100) NOT NULL,
payload JSONB NOT NULL,
endpoint_id VARCHAR(64) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
sent_at TIMESTAMPTZ,
attempts INT NOT NULL DEFAULT 0,
next_retry TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_outbox_pending
ON webhook_outbox (next_retry)
WHERE sent_at IS NULL;
The business logic writes to the outbox within the same transaction:
async function createOrder(
db: Pool,
order: OrderData,
webhookEndpoint: string
): Promise {
const client = await db.connect()
try {
await client.query('BEGIN')
// Business write
const result = await client.query(
'INSERT INTO orders (customer_id, total, status) VALUES ($1, $2, $3) RETURNING id',
[order.customerId, order.total, 'confirmed']
)
// Outbox write (same transaction)
await client.query(
`INSERT INTO webhook_outbox (event_type, payload, endpoint_id)
VALUES ($1, $2, $3)`,
[
'order.created',
JSON.stringify({
id: result.rows[0].id,
customerId: order.customerId,
total: order.total,
status: 'confirmed',
}),
webhookEndpoint,
]
)
await client.query('COMMIT')
} catch (err) {
await client.query('ROLLBACK')
throw err
} finally {
client.release()
}
}
A separate publisher process polls the outbox and sends pending webhooks. On success, it marks them as sent. On failure, it increments the attempt count and schedules a retry using exponential backoff. This decouples the business transaction from webhook delivery, so your API response times are unaffected by slow or failing webhook endpoints.
Idempotent Consumers: The Receiver's Responsibility
The sender provides at-least-once delivery. The receiver must handle duplicates. The pattern is simple: store the ID of every processed event, and check incoming events against that store before doing any work.
interface DeduplicationStore {
has(eventId: string): Promise
add(eventId: string, ttlSeconds: number): Promise
}
// Redis-backed dedup store with automatic expiry
class RedisDeduplicationStore implements DeduplicationStore {
constructor(private redis: RedisClient) {}
async has(eventId: string): Promise {
const exists = await this.redis.exists(`webhook:seen:${eventId}`)
return exists === 1
}
async add(eventId: string, ttlSeconds: number): Promise {
await this.redis.set(`webhook:seen:${eventId}`, '1', 'EX', ttlSeconds)
}
}
async function handleWebhook(
req: Request,
dedup: DeduplicationStore
): Promise {
const eventId = req.headers.get('X-Webhook-Id')
if (!eventId) {
return new Response('Missing event ID', { status: 400 })
}
if (await dedup.has(eventId)) {
// Already processed, return 200 so the sender stops retrying
return new Response('Duplicate', { status: 200 })
}
const body = await req.json()
await processEvent(body)
// Mark as processed with 7-day TTL (covers any retry window)
await dedup.add(eventId, 604_800)
return new Response('OK', { status: 200 })
}
The TTL on the dedup key matters. Set it too short and a delayed retry slips through as a duplicate. Set it too long and your storage grows without bound. Seven days covers the retry window of every major webhook provider I have worked with. Stripe retries over 3 days, GitHub over a few hours, Shopify over 48 hours. Seven days gives comfortable margin for all of them.
One subtlety: always return a 200 for duplicates, never a 4xx. The sender does not know the receiver already processed the event. If you return a 409 or 422, many senders will interpret that as a failure and keep retrying, creating an infinite retry loop for an event you already handled.
Ordering Guarantees
Delivery guarantees answer the question "will the event arrive?" Ordering guarantees answer "will events arrive in the right sequence?" These are separate problems, and conflating them leads to brittle architectures.
Global ordering, where every event arrives in the exact order it was produced, is expensive and fragile. It requires a single delivery pipeline, kills parallelism, and means one slow consumer blocks everything behind it. I have rarely seen a webhook system where global ordering is worth the cost.
Per-entity ordering is more practical. Events about the same customer, order, or resource should arrive in sequence, but events about different entities can be delivered in any order. The sender includes a sequence number scoped to the entity, and the receiver tracks the last processed sequence per entity.
async function handleOrderedWebhook(
event: { entityId: string; sequence: number; data: unknown },
db: Pool
): Promise<{ processed: boolean }> {
const result = await db.query(
'SELECT last_sequence FROM entity_sequences WHERE entity_id = $1',
[event.entityId]
)
const lastSequence = result.rows[0]?.last_sequence ?? 0
if (event.sequence <= lastSequence) {
// Already processed or out of order, skip
return { processed: false }
}
if (event.sequence > lastSequence + 1) {
// Gap detected: a prior event is missing
// Queue this event for later processing or request a backfill
await db.query(
'INSERT INTO pending_events (entity_id, sequence, data) VALUES ($1, $2, $3)',
[event.entityId, event.sequence, JSON.stringify(event.data)]
)
return { processed: false }
}
// Process in order
await processEvent(event.data)
await db.query(
`INSERT INTO entity_sequences (entity_id, last_sequence)
VALUES ($1, $2)
ON CONFLICT (entity_id) DO UPDATE SET last_sequence = $2`,
[event.entityId, event.sequence]
)
return { processed: true }
}
Gap handling is the hard part. When event 5 arrives but you only have events 1 through 3, you need a strategy. Buffering and waiting is the simplest approach: hold event 5 in a pending table and process it when event 4 arrives. The alternative is requesting a backfill from the sender, but not all providers support that. In my experience, most ordering problems in webhook systems come from retry timing rather than actual production ordering, so a short buffer (30-60 seconds) resolves the vast majority of gaps without intervention.
The Dual-Write Problem in Detail
I mentioned the dual-write problem earlier, but it deserves closer attention because it is the single most common source of webhook reliability bugs. The pattern looks innocent:
// DANGEROUS: dual write without coordination
async function handlePayment(payment: Payment): Promise {
// Write 1: update database
await db.query(
'UPDATE payments SET status = $1 WHERE id = $2',
['completed', payment.id]
)
// Write 2: send webhook
await fetch(webhookUrl, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ event: 'payment.completed', data: payment }),
})
}
If the process crashes between the database write and the webhook send, the payment is marked as completed but the consumer never finds out. If you reverse the order and send the webhook first, then update the database, a database failure means you have told the consumer about a payment that your system does not consider complete. Both orderings are wrong.
The outbox pattern eliminates this entirely. The webhook event is written to the outbox in the same transaction as the payment update. They succeed together or fail together. The background publisher handles delivery independently, with its own retry logic. There is no window where the database and the webhook can be out of sync.
An alternative to the outbox is Change Data Capture (CDC), using tools like Debezium to stream database changes to a message broker, which then triggers webhook delivery. CDC is more infrastructure to operate, but it has the advantage of not requiring any changes to your application's write path. For teams that already run Kafka or similar, CDC is a strong option. For most teams starting out, the outbox pattern is simpler and has fewer moving parts.
Choosing the Right Guarantee
Not every webhook needs the same level of reliability. Over-engineering delivery for low-value events wastes resources. Under-engineering for high-value events causes incidents. Here is how I think about the decision.
At-most-once works for events where the cost of missing one is minimal and the volume is high. Real-time analytics, activity feeds, typing indicators, cache busts. If you are sending 50,000 analytics pings per minute, adding retry infrastructure for each one is wasteful. Accept the ~0.5% loss rate and move on.
At-least-once with idempotent consumers is the right choice for the vast majority of webhook use cases. Payment notifications, order updates, subscription changes, deployment triggers, anything where a missed event would cause a user-visible problem or a support ticket. The outbox pattern on the sending side plus idempotency keys on the receiving side gives you reliable, consistent event delivery with manageable complexity.
At-least-once with ordered delivery adds sequence tracking on top. Reserve this for cases where processing events out of order would cause incorrect state, such as status machine transitions, incremental balance updates, audit logs. The added complexity of gap detection and buffering is only justified when ordering actually matters for correctness, not just for aesthetics.
A pattern I have seen work well at scale: default to at-least-once for all webhook events, then layer ordering only for specific event types that need it. Do not try to build one universal pipeline that handles every guarantee level. Let the event type determine the processing strategy, and keep the common path simple.
The bottom line: exactly-once delivery is a myth over HTTP. Accept it. Build senders that deliver at-least-once using the outbox pattern, and receivers that process idempotently using dedup stores. This combination gives you the reliability of exactly-once semantics without fighting the fundamental constraints of distributed systems.