It Works in Staging
Your webhook handler passes every test. Load testing with 100 requests per second? No problem. Deploy to production, and within an hour you start seeing intermittent failures. 2% of webhooks timeout. Connection refused errors spike during traffic bursts. Restart the pods and it clears up for 20 minutes, then returns.
These are not code bugs. The logic is fine. This is infrastructure failing under real-world conditions that staging never replicates. Database connection pools saturate. TCP sockets leak. Load balancers drop requests during deployments. Container orchestration restarts your pod mid-request. None of this shows up when you test with synthetic traffic.
I have debugged enough of these to recognize the pattern: if the failure rate is below 5%, it disappears in local testing, and it correlates with traffic spikes, you are looking at an infrastructure bottleneck. Stop inspecting your application code and start looking at what sits between the internet and your handler.
Connection Pool Exhaustion
Most Node.js webhook handlers use a database connection pool. Postgres, MySQL, Redis. The default pool size is usually 10 connections. Works great in development. In production, under burst traffic from webhook retries, you exhaust those 10 connections in seconds. New requests wait for a connection to free up, the webhook sender times out after 10 seconds, and you log a "connection timeout" error that points at nothing useful.
The symptom: webhook failures cluster in bursts. You see 50 successful deliveries, then 8 timeouts in a row, then success again. Your application logs show "waiting for available connection" right before the timeout. If you check your database metrics, you will see connection count maxed out during those windows.
The knee-jerk fix is to increase the pool size to 50 or 100. Bad idea. Your database has a hard limit on total connections (Postgres defaults to 100), and if you are running 5 application instances with 50 connections each, you are fighting for 250 connections when the database can only handle 100. The right answer is to tune pool size per instance and add connection timeout visibility.
import { Pool } from 'pg'
const pool = new Pool({
host: process.env.DB_HOST,
database: process.env.DB_NAME,
user: process.env.DB_USER,
password: process.env.DB_PASSWORD,
max: 15,
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 3000,
})
pool.on('error', (err) => {
console.error('Unexpected pool error', {
message: err.message,
stack: err.stack,
poolSize: pool.totalCount,
idleSize: pool.idleCount,
waitingSize: pool.waitingCount,
})
})
pool.on('connect', () => {
console.log('Pool connection created', {
totalCount: pool.totalCount,
idleCount: pool.idleCount,
waitingCount: pool.waitingCount,
})
})
// Expose pool metrics for your monitoring
export function getPoolMetrics() {
return {
total: pool.totalCount,
idle: pool.idleCount,
waiting: pool.waitingCount,
}
}
Set max to a value that divides evenly into your database's connection limit across all instances. If your DB supports 100 connections and you run 5 instances, use 18 per instance (5 × 18 = 90, leaving 10 for admin/migrations). Set connectionTimeoutMillis low enough that your webhook sender does not time out first. If Stripe waits 20 seconds and your connection timeout is 30 seconds, Stripe gives up before you even know there is a problem.
Monitor waitingCount. If it stays above 0 for more than a few seconds, you are undersized or leaking connections. Check for queries that never close their connection (missing finally blocks, unhandled promise rejections that skip cleanup).
TCP Socket Leaks
Node.js HTTP agents pool TCP sockets for outbound requests. If your webhook handler makes HTTP calls to external APIs (lookup customer data, send a Slack notification, hit a third-party service), each request potentially creates a socket. Those sockets should close after the response completes, but if you do not explicitly configure keepAlive: false or handle connection cleanup, they linger.
After a few thousand webhook deliveries, you hit the OS file descriptor limit. New outbound requests fail with EMFILE or ENFILE errors. The process cannot open new files, including database connections and log files. Everything grinds to a halt. Restart the pod and it clears up for an hour, then comes back.
The fix is to configure HTTP agents with explicit limits and aggressive timeouts:
import http from 'http'
import https from 'https'
import axios from 'axios'
const httpAgent = new http.Agent({
keepAlive: true,
keepAliveMsecs: 10000,
maxSockets: 50,
maxFreeSockets: 10,
timeout: 5000,
})
const httpsAgent = new https.Agent({
keepAlive: true,
keepAliveMsecs: 10000,
maxSockets: 50,
maxFreeSockets: 10,
timeout: 5000,
})
const client = axios.create({
timeout: 8000,
httpAgent,
httpsAgent,
})
// Cleanup on shutdown
process.on('SIGTERM', () => {
httpAgent.destroy()
httpsAgent.destroy()
})
maxSockets caps total sockets per host. maxFreeSockets limits how many idle sockets stick around. Set timeout lower than your webhook sender's timeout so slow external APIs do not cascade into webhook delivery failures. If an external lookup takes 12 seconds and Stripe times out at 10 seconds, you return an error to Stripe even though your handler is still running.
To diagnose socket leaks in production, check lsof or /proc/{pid}/fd. If you see thousands of sockets in CLOSE_WAIT or FIN_WAIT, you are not cleaning up properly. Add monitoring for open file descriptors and alert when it crosses 50% of your OS limit.
Load Balancer Idle Timeouts
Most cloud load balancers (AWS ALB, GCP Cloud Load Balancer, Azure Application Gateway) have an idle timeout between 30 and 60 seconds. If your webhook handler takes longer than that to respond, the load balancer closes the connection to the client even though your application is still processing. The webhook sender sees a connection reset. Your handler finishes successfully and logs a 200 response, but the sender never receives it.
This creates a particularly nasty failure mode: your application logs show success, your database shows the webhook was processed, but the sender retries because they never got a response. You end up with duplicate processing unless you have robust idempotency checks.
The pattern I mentioned earlier applies here too: acknowledge the webhook immediately, queue for async processing. Your HTTP response completes in under 100ms, well below any load balancer timeout, and the actual work happens in a background job.
import type { Request, Response } from 'express'
import { v4 as uuid } from 'uuid'
app.post('/webhooks', async (req: Request, res: Response) => {
const eventId = uuid()
const rawBody = (req as any).rawBody as Buffer
await db.query(
'INSERT INTO webhook_events (id, payload, status) VALUES ($1, $2, $3)',
[eventId, rawBody.toString('utf8'), 'pending']
)
res.status(200).json({ eventId })
await queue.enqueue('webhook.process', { eventId })
})
If you absolutely must run synchronous processing (rare, but some providers require an immediate response based on your logic), configure your load balancer's idle timeout to match your maximum expected processing time plus a buffer. AWS ALB supports up to 4000 seconds. GCP supports up to 3600 seconds. But honestly, if your webhook handler takes more than 30 seconds, you are doing too much in the request path.
Container Restarts Mid-Request
Kubernetes, ECS, Docker Swarm. They all do rolling deployments. Your pod receives a SIGTERM, your orchestrator waits a grace period (default 30 seconds in Kubernetes), then SIGKILL if the process has not exited. During that grace period, new connections are routed to other pods, but in-flight requests are supposed to complete. In practice, many applications do not handle SIGTERM gracefully.
Without proper shutdown handling, your webhook handler gets killed mid-request. The sender sees a connection reset or timeout. You log nothing because the process was terminated before it could flush logs. The event never gets marked as processed, so when the sender retries, you process it again on a different pod.
The fix is to implement graceful shutdown: stop accepting new requests, wait for in-flight requests to complete, then exit cleanly.
import express from 'express'
import http from 'http'
const app = express()
const server = http.createServer(app)
let isShuttingDown = false
app.use((req, res, next) => {
if (isShuttingDown) {
res.set('Connection', 'close')
return res.status(503).json({ error: 'Server shutting down' })
}
next()
})
app.post('/webhooks', async (req, res) => {
// Your handler logic
res.status(200).json({ received: true })
})
const PORT = process.env.PORT ?? 3000
server.listen(PORT, () => {
console.log('Server running on port', PORT)
})
function gracefulShutdown(signal: string) {
console.log('Received signal', signal)
isShuttingDown = true
server.close((err) => {
if (err) {
console.error('Error during shutdown', err)
process.exit(1)
}
console.log('Server closed cleanly')
process.exit(0)
})
setTimeout(() => {
console.error('Forced shutdown after timeout')
process.exit(1)
}, 25000) // Exit before Kubernetes SIGKILL at 30s
}
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'))
process.on('SIGINT', () => gracefulShutdown('SIGINT'))
The middleware checks isShuttingDown and rejects new requests with 503. Existing requests complete normally. server.close() waits for in-flight requests to finish, but the timeout ensures you exit before Kubernetes kills you forcefully. Set the timeout to a few seconds less than your orchestrator's grace period.
One gotcha: if you have long-running background jobs triggered by webhooks, shutting down the HTTP server does not stop those jobs. You need to track active jobs separately and wait for them to complete or move them to a dedicated worker process that can be scaled independently of your HTTP handlers.
Database Lock Contention
Webhooks often update the same rows concurrently. Multiple subscription.updated events for the same subscription arriving within milliseconds. If your handler does SELECT ... FOR UPDATE or uses database-level locking, the second webhook blocks waiting for the first to commit. If the first takes 5 seconds to process and the second times out after 3 seconds, you log a timeout error even though the logic is correct.
The failure mode is sneaky: it only appears when you get multiple webhooks for the same resource in quick succession, which happens during high-activity periods (Black Friday, product launches, mass imports). In staging, your test data is spread across different resources so you never hit the same lock twice.
The fix depends on your consistency requirements. If you can tolerate eventual consistency, use optimistic locking with a version column:
async function updateSubscription(id: string, updates: Partial<Subscription>) {
const current = await db.query(
'SELECT version FROM subscriptions WHERE id = $1',
[id]
)
const nextVersion = current.rows[0].version + 1
const result = await db.query(
`UPDATE subscriptions
SET status = $1, version = $2
WHERE id = $3 AND version = $4
RETURNING *`,
[updates.status, nextVersion, id, current.rows[0].version]
)
if (result.rowCount === 0) {
throw new Error('Concurrent update detected, retry')
}
return result.rows[0]
}
If a concurrent webhook already updated the row, your update returns zero rows. You can retry with the new version or let the sender retry the webhook. This avoids blocking but requires handling the retry logic.
If you need strict consistency, use advisory locks instead of row-level locks. Advisory locks do not block reads, only other advisory locks, and they can have a timeout:
async function withLock<T>(
lockKey: number,
fn: () => Promise<T>,
timeoutMs = 5000
): Promise<T> {
const acquired = await db.query(
'SELECT pg_try_advisory_lock($1) as locked',
[lockKey]
)
if (!acquired.rows[0].locked) {
throw new Error('Could not acquire lock')
}
try {
const result = await Promise.race([
fn(),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Lock timeout')), timeoutMs)
),
]) as T
return result
} finally {
await db.query('SELECT pg_advisory_unlock($1)', [lockKey])
}
}
// Usage
await withLock(subscriptionId.hashCode(), async () => {
await updateSubscription(subscriptionId, { status: 'active' })
})
Advisory locks are session-scoped in Postgres, so they release automatically if your connection drops. They are also much faster than row locks because they do not involve transaction isolation overhead. Use them when you need to serialize access to a logical resource (a subscription, a user account) without blocking reads.
Memory Leaks That Only Appear Over Days
Your webhook handler runs fine for 24 hours, then memory usage climbs from 200MB to 2GB over the next 12 hours, and the container gets OOM-killed. Restart it and the pattern repeats. This is not a burst traffic issue; it is a slow accumulation of unreleased references.
Common causes: event listeners that never get removed, caches that grow unbounded, closures holding onto large objects. In Node.js, the event emitter pattern is notorious for leaks if you attach listeners in a request handler without cleaning them up.
// Bad: listener leaks on every request
app.post('/webhooks', (req, res) => {
const handler = () => {
console.log('Something happened')
}
eventBus.on('data.updated', handler)
res.status(200).json({ ok: true })
})
// Good: use .once() or remove listener explicitly
app.post('/webhooks', (req, res) => {
const handler = () => {
console.log('Something happened')
eventBus.off('data.updated', handler)
}
eventBus.once('data.updated', handler)
res.status(200).json({ ok: true })
})
To diagnose memory leaks in production, take heap snapshots at intervals and compare. Node.js has built-in support via v8.writeHeapSnapshot(). Configure your container orchestrator to expose a debug endpoint that triggers a snapshot on-demand:
import v8 from 'v8'
import fs from 'fs'
import path from 'path'
app.get('/debug/heap-snapshot', (req, res) => {
const filename = `heap-${Date.now()}.heapsnapshot`
const filepath = path.join('/tmp', filename)
v8.writeHeapSnapshot(filepath)
res.download(filepath, filename, (err) => {
if (err) console.error('Download failed', err)
fs.unlinkSync(filepath)
})
})
Load the snapshot in Chrome DevTools. Look for objects that keep growing across snapshots. Constructors with tens of thousands of instances that should only have hundreds. Detached DOM nodes if you are using something like jsdom. Strings that are never garbage collected. The comparison view in DevTools makes this obvious.
DNS Resolution Failures Under Load
Your webhook handler calls an external API to look up customer data. Works fine in staging, but in production during peak traffic you get intermittent ENOTFOUND errors. DNS resolution is failing. The reason: Node.js DNS resolver is synchronous by default and uses the OS resolver, which has a limited cache size and can get overwhelmed under high request rates.
The fix is to use a DNS caching library like dnscache or configure a local caching DNS server (dnsmasq, systemd-resolved):
import dns from 'dns'
import dnscache from 'dnscache'
dnscache({
enable: true,
ttl: 300,
cachesize: 1000,
})
// Now all dns.lookup() calls use the cache
Even better: if you control the domain you are calling, use connection pooling with keepAlive so you reuse TCP connections and skip DNS lookups entirely after the first request.
Debugging Across Pods
The hardest infrastructure bugs are the ones that only affect one pod out of five. 80% of webhooks succeed, 20% timeout, and the timeouts all route to the same pod. That pod has a subtle difference: an environment variable is set incorrectly, or it mounted a stale config, or the disk is full because log rotation is not working.
You cannot reproduce it locally because your local environment does not match that specific pod's config. The failure is intermittent from the sender's perspective but consistent from that pod's perspective. The pattern to look for: failures that cluster by pod ID or node IP.
Add pod/node metadata to your logs so you can correlate failures:
console.log(JSON.stringify({
type: 'webhook_request',
podName: process.env.POD_NAME,
nodeName: process.env.NODE_NAME,
requestId: req.headers['x-request-id'],
statusCode: res.statusCode,
}))
Then query your logs: statusCode:500 | stats count by podName. If one pod has 10x more errors than others, SSH into that pod and check disk space, memory pressure, and environment variables. Compare them to a healthy pod.
When to Give Up on Fixing It
Not all infrastructure issues are worth fixing. If 0.01% of your webhooks fail due to a transient network blip between availability zones, and retries handle it cleanly, do not spend a week trying to eliminate that last 0.01%. The cost of the engineering time exceeds the cost of occasional retries.
The threshold I use: if the failure rate is below 1%, retries succeed on the second attempt, and no customer has complained, it is not worth chasing. Focus on the 5% failures that repeat across retries and block critical workflows. Those are the ones that cost you money and customer trust.