Back to Blog
Best practices

Your Webhook Tests Pass. Your Production Still Breaks. Time to Inject Some Chaos.

Happy-path webhook tests are table stakes. Chaos testing and failure injection reveal how your system actually behaves when webhooks arrive late, malformed, or not at all.

WebhookVault Team··8 min read

Your Green Test Suite Is Lying to You

Every webhook integration I've seen has tests. They parse a sample payload, verify the signature, update the database, return 200. All green. Ship it.

Then production happens. Stripe sends a payment_intent.succeeded event 47 seconds after the customer already closed the tab and opened a support ticket. GitHub delivers the same push event three times because your server returned 200 but took 11 seconds to respond, so their retry logic kicked in. Shopify sends an order webhook with a line item format that changed six weeks ago and nobody noticed because the test fixtures are pinned to the old schema.

Happy-path tests are table stakes. They prove your code can handle the textbook scenario. But webhooks don't live in textbooks. They live in distributed systems where networks partition, clocks drift, and providers ship breaking changes on a Friday afternoon.

What Chaos Testing Actually Means for Webhooks

Chaos engineering isn't about randomly destroying things. That's just negligence with a fancy name. Real chaos testing is about forming hypotheses about how your system behaves under specific failure conditions, then verifying those hypotheses by injecting controlled faults.

For webhook systems, the failure modes are surprisingly specific:

  • Delayed delivery (event arrives minutes or hours late)
  • Duplicate delivery (same event, same payload, multiple times)
  • Out-of-order delivery (update event arrives before create event)
  • Partial payload (fields missing or null that your code assumes exist)
  • Malformed signatures (valid payload, invalid HMAC)
  • Provider schema changes (new fields, removed fields, type changes)
  • Slow consumer (your handler takes too long, provider times out and retries)

Most teams test zero of these deliberately. They discover them in production incident reports.

Building a Webhook Fault Injector

You don't need Netflix-grade tooling for this. A proxy that sits between the webhook provider and your handler is enough. In staging, point your webhook URL at the proxy instead of your app directly. The proxy forwards requests to your real handler but can mutate them on the way through.

const express = require('express');
const axios = require('axios');
const crypto = require('crypto');

const app = express();
const TARGET = process.env.WEBHOOK_TARGET || 'http://localhost:3000/webhooks';

// fault configuration, flip these on and off
const faults = {
  delayMs: 0,           // artificial delay before forwarding
  duplicateRate: 0,     // 0-1, chance of sending the same request twice
  dropRate: 0,          // 0-1, chance of silently dropping the request
  corruptSignature: false,
  shuffleOrder: false,
  stripFields: [],      // field paths to remove from payload
};

app.use(express.raw({ type: '*/*' }));

app.post('/inject', async (req, res) => {
  // should we drop it entirely?
  if (Math.random() < faults.dropRate) {
    console.log('[FAULT] Dropped webhook');
    return res.status(200).send('dropped');
  }

  let body = req.body;
  let headers = { ...req.headers };

  // corrupt the signature if configured
  if (faults.corruptSignature && headers['x-hub-signature-256']) {
    headers['x-hub-signature-256'] = 'sha256=' + crypto.randomBytes(32).toString('hex');
    console.log('[FAULT] Corrupted signature');
  }

  // strip fields from JSON payloads
  if (faults.stripFields.length > 0) {
    try {
      const parsed = JSON.parse(body.toString());
      faults.stripFields.forEach(field => delete parsed[field]);
      body = Buffer.from(JSON.stringify(parsed));
      console.log('[FAULT] Stripped fields:', faults.stripFields);
    } catch (e) {
      // not JSON, skip
    }
  }

  const forward = async () => {
    try {
      const resp = await axios.post(TARGET, body, {
        headers: { 'content-type': headers['content-type'] },
        timeout: 30000,
      });
      return resp.status;
    } catch (err) {
      return err.response?.status || 'network_error';
    }
  };

  // delay if configured
  if (faults.delayMs > 0) {
    console.log('[FAULT] Delaying ' + faults.delayMs + 'ms');
    await new Promise(r => setTimeout(r, faults.delayMs));
  }

  const status = await forward();

  // duplicate delivery
  if (Math.random() < faults.duplicateRate) {
    console.log('[FAULT] Sending duplicate');
    await forward();
  }

  res.status(200).json({ forwarded: true, targetStatus: status });
});

app.listen(4000);

Crude? Sure. But I've caught more bugs with a 60-line proxy like this than with elaborate integration test suites. The point isn't polish. It's making failure visible before your customers do.

The Duplicate Delivery Test That Breaks Everything

Set duplicateRate: 1.0 and watch what happens. Every single webhook gets delivered twice. If your system handles this gracefully, congratulations, you're ahead of roughly 80% of webhook consumers I've audited.

Most don't. The classic failure: a payment webhook creates a transaction record, and the duplicate creates another one. Customer gets charged once, your database says twice, accounting is off, and the bug sits there for weeks because the amounts are small enough that nobody notices until reconciliation.

The fix is idempotency, and everyone knows that in theory. But the implementation details matter enormously. Do you deduplicate on the event ID from the provider? On a hash of the payload? On some combination of event type and resource ID? Each choice has tradeoffs.

// simple but effective: deduplicate on provider event ID
async function handleWebhook(event) {
  const eventId = event.headers['x-webhook-id'] || event.body.id;

  // atomic check-and-insert
  const result = await db.query(
    'INSERT INTO processed_events (event_id, received_at) VALUES ($1, NOW()) ON CONFLICT (event_id) DO NOTHING RETURNING event_id',
    [eventId]
  );

  if (result.rowCount === 0) {
    // already processed, bail out
    console.log('Duplicate event, skipping:', eventId);
    return { status: 200, body: 'duplicate' };
  }

  // process normally
  return processEvent(event.body);
}

That ON CONFLICT DO NOTHING is doing heavy lifting. Without it, you're racing against yourself. And yes, you need a unique index on event_id, not just a regular index. Seen that mistake more than once.

Delayed Delivery: When "Real-Time" Takes 45 Minutes

Set your fault injector to delayMs: 300000. Five minutes. That's not even an extreme scenario; Stripe's documentation openly states that webhooks may be delayed during high-volume periods. I've personally seen GitHub webhook deliveries arrive over 40 minutes late during their incident in late 2024.

What does your system do when a subscription.canceled event arrives 30 minutes after the customer already re-subscribed? If your handler blindly processes events in arrival order, you just canceled an active subscription. The customer is furious and your support team has no idea what happened because the logs show the cancellation was "processed successfully."

The pattern that saves you: always check current state before applying an event.

async function handleSubscriptionCanceled(event) {
  const sub = await db.subscriptions.findOne({ providerId: event.data.id });

  if (!sub) {
    // subscription doesn't exist locally yet, queue for retry
    await retryQueue.add(event, { delay: 5000 });
    return;
  }

  // compare timestamps, not arrival order
  const eventTimestamp = new Date(event.data.canceled_at);
  const lastUpdate = new Date(sub.lastProviderUpdate);

  if (eventTimestamp <= lastUpdate) {
    console.log('Stale event, current state is newer');
    return { status: 200, body: 'stale' };
  }

  await db.subscriptions.update(
    { providerId: event.data.id },
    { status: 'canceled', lastProviderUpdate: eventTimestamp }
  );
}

Timestamp comparison. Simple concept. But you'd be amazed how many webhook handlers just blindly apply whatever event shows up, trusting that arrival order matches reality. It doesn't.

The Field That Disappeared

Configure your fault injector to strip a field. Start with something your code definitely uses, like amount on a payment event or email on a customer event. Does your handler crash? Return 500? Silently insert null into your database?

All three are bad, but silently inserting null is the worst because you won't know until someone queries that data weeks later.

Webhook providers change their payloads. Sometimes they document it. Sometimes they add a field and it's fine. Sometimes they rename shipping_address to shipping and suddenly your destructuring assignment pulls undefined and your downstream code quietly produces garbage.

Validate at the boundary. Not with a full JSON Schema validator necessarily; even basic checks help:

function validatePaymentEvent(payload) {
  const required = ['id', 'amount', 'currency', 'customer_id'];
  const missing = required.filter(f => payload[f] === undefined || payload[f] === null);

  if (missing.length > 0) {
    // log the full payload for debugging, reject the event
    logger.error('Payment webhook missing required fields', {
      missing,
      eventId: payload.id,
      receivedFields: Object.keys(payload),
    });

    // return 400 so the provider retries
    // maybe the payload will be complete next time (unlikely, but worth trying)
    throw new WebhookValidationError('Missing fields: ' + missing.join(', '));
  }

  return payload;
}

Return 400, not 200, when validation fails. A 200 tells the provider "got it, all good, stop retrying." A 400 or 500 triggers their retry logic, which buys you time to investigate and deploy a fix if the schema actually changed.

Slow Consumer: The Timeout You Forgot About

This one is sneaky. Your handler works. It processes the event, updates the database, fires off some downstream API calls, sends a notification. Takes about 8 seconds total. Perfectly functional.

Except Stripe times out webhook deliveries at 20 seconds. So your 8-second handler is fine. Until you add one more downstream call and it goes to 12 seconds. Still fine. Then one of those downstream APIs has a slow day and your handler takes 25 seconds. Stripe marks it as failed. Retries. Your handler processes it again from scratch, takes another 25 seconds because the downstream API is still slow. Stripe retries again.

Now you've got three instances of the same event being processed simultaneously. None of them are completing within the timeout. Your database has partial writes from all three attempts.

The fix is old but reliable: acknowledge fast, process async.

app.post('/webhooks/stripe', async (req, res) => {
  // validate signature first, this should be sub-millisecond
  const isValid = verifyStripeSignature(req);
  if (!isValid) return res.status(401).send('invalid signature');

  // persist the raw event immediately
  await db.webhookEvents.insert({
    provider: 'stripe',
    eventId: req.body.id,
    payload: req.body,
    status: 'pending',
    receivedAt: new Date(),
  });

  // tell Stripe we got it
  res.status(200).send('queued');

  // a background worker picks up pending events
  // and processes them with proper retry logic, timeouts, etc.
});

Respond in under a second. Always. The actual processing happens in a worker that can take as long as it needs, retry on its own schedule, and handle failures without the provider's timeout clock ticking.

Running Chaos Tests in CI

You don't need to run fault injection against production. Please don't, actually, unless you have a very mature chaos engineering practice. But you can absolutely run these scenarios in CI.

Spin up your app, spin up the fault injector proxy, point simulated webhook deliveries through the proxy with various fault configurations enabled. Assert that your system handles each scenario correctly. Duplicates don't create duplicate records. Delayed events don't overwrite newer state. Missing fields produce clear error logs, not silent data corruption.

# in your CI pipeline
services:
  app:
    build: .
    ports: ["3000:3000"]
  fault-proxy:
    build: ./test/fault-proxy
    environment:
      WEBHOOK_TARGET: http://app:3000/webhooks
    ports: ["4000:4000"]

# test script
echo "Testing duplicate delivery..."
FAULTS='{"duplicateRate":1}' curl -X PUT http://localhost:4000/config -d "$FAULTS"
curl -X POST http://localhost:4000/inject -d @test/fixtures/payment.json
sleep 2
RECORDS=$(curl http://localhost:3000/api/payments?eventId=evt_test123 | jq length)
[ "$RECORDS" -eq 1 ] || (echo "FAIL: duplicate created $RECORDS records" && exit 1)

Not glamorous. Not a fancy testing framework. But it catches real bugs that unit tests never will.

What You Should Actually Do Tomorrow

You probably won't build a full chaos testing pipeline this week. That's fine. But you can do three things that take less than an hour each and will save you from the most common webhook failures:

First, add idempotency checks to every webhook handler. Deduplicate on the provider's event ID. Use an atomic database operation, not check-then-insert.

Second, add timestamp comparison before applying state changes. Never trust arrival order. Compare the event's timestamp against your last known state.

Third, validate required fields at the boundary and return a non-200 status code when validation fails. Log the full payload so you can debug later.

These three changes handle the vast majority of webhook failures that chaos testing would reveal. The fault injector is great for proving they work, but the defensive code is what actually saves you at 2 AM.