You Launched v2. Now What?
Shipping a new webhook version feels great. You fixed the weird naming conventions from 2019, you cleaned up the nested objects, maybe you even added proper timestamps instead of that epoch-milliseconds-as-a-string thing someone committed on a Friday afternoon three years ago. v2 is clean. v2 is good.
And then v1 just... sits there. Forever.
Because nobody wants to be the person who breaks a paying customer's integration by flipping the switch. So v1 stays running. Then v3 ships. v1 is still running. You're now maintaining three separate payload formats, three sets of serialization logic, three codepaths that all need to stay in sync whenever you add a new event type. I've watched teams burn 30% of their webhook engineering time just keeping old versions alive.
Why Most Deprecation Timelines Are Fiction
The standard playbook says: announce deprecation, give consumers 6 months, then shut it down. Simple.
Except it never works that way. Six months pass. You check the logs. 40% of your consumers are still on v1. Some of them are enterprise accounts paying you serious money. Their integration was built by a contractor who left two years ago and nobody internally knows how it works. They're not migrating in the next two weeks because you sent a deprecation header.
Stripe got this right early on. They don't just announce deprecation dates. They actively reach out to high-volume consumers months before cutoff. They provide migration guides specific to the changes that affect each consumer. And they still extend deadlines when the numbers don't look good enough. If Stripe, with their resources, can't just flip the switch on a date, you probably can't either.
The Usage Cliff Strategy
Forget arbitrary dates. Track the actual traffic.
Build a dashboard (or even just a weekly cron job that dumps numbers into Slack) showing version distribution across your consumer base. You want two numbers per version: unique consumers, and total request volume. These tell different stories. A version might serve only 3 consumers but those 3 might account for 60% of your total webhook traffic.
// quick and dirty version tracking middleware
async function trackWebhookVersion(req, res, next) {
const version = req.headers['x-webhook-version'] || 'v1';
const consumerId = req.params.consumerId;
// fire and forget, don't block the delivery
metrics.increment('webhook.delivery', {
version,
consumer: consumerId,
event_type: req.body.type
});
next();
}
// then weekly, something like:
// SELECT version, COUNT(DISTINCT consumer_id), COUNT(*)
// FROM webhook_deliveries
// WHERE delivered_at > NOW() - INTERVAL '7 days'
// GROUP BY version
When a version drops below 5% of unique consumers, that's your signal to start active outreach. Not a blog post announcement. Actual emails to the remaining consumers, with specifics about what changes affect them.
Parallel Versions Are More Expensive Than You Think
The hidden cost isn't the compute. Running two serializers in parallel is cheap.
The real cost is cognitive. Every new event type needs payload definitions for every active version. Every bug fix needs testing across versions. Every new team member needs to understand why user.created has three different shapes depending on which version a consumer registered with. Documentation multiplies. Test suites multiply. The surface area for subtle inconsistencies grows with every version you keep alive.
A fintech startup I worked with ran four concurrent webhook versions for almost a year. They had exactly two backend engineers on the integrations team. Those two spent more time on version compatibility than on actual feature work. When they finally committed to an aggressive sunset plan, feature velocity doubled within a quarter.
Two versions in parallel is manageable. Three is painful. Four is a staffing problem disguised as a technical one.
Graduated Degradation: The Gentler Kill Switch
Hard cutoffs scare everyone, including your own team. So don't do hard cutoffs.
Instead, degrade the old version gradually. Start by reducing the retry count. v2 consumers get 5 retries with exponential backoff. v1 consumers get 3. Then 2. Then 1. You're not breaking anyone's integration, you're just making the old version less reliable. Consumers who care about delivery guarantees will migrate. The ones who don't care, well, they probably aren't checking their webhook endpoints anyway.
function getRetryPolicy(version) {
const policies = {
'v3': { maxRetries: 5, backoffBase: 2, maxDelay: 3600 },
'v2': { maxRetries: 3, backoffBase: 2, maxDelay: 1800 },
// sunset candidate: reduced reliability
'v1': { maxRetries: 1, backoffBase: 1, maxDelay: 300 },
};
return policies[version] || policies['v1'];
}
Next step: add latency. v1 deliveries get queued with lower priority. They still arrive, but maybe 30 seconds later instead of sub-second. This is surprisingly effective because the consumers who need real-time delivery are exactly the ones with active engineering teams who can handle a migration.
Some teams add a X-Sunset-Date header and a X-Deprecated: true header to every v1 delivery. Most consumers never look at headers, but the ones running automated monitoring will flag it.
The "Shadow Dual-Send" Migration Path
Best pattern I've seen for high-stakes migrations: dual-send during the transition window.
When a consumer opts into v2, don't immediately stop v1. Send both versions to their endpoint for a configurable overlap period. The v2 payload includes a _version field. The consumer can validate that their v2 handler works correctly while still falling back to v1 parsing if something breaks.
async function deliverWebhook(event, consumer) {
const targetVersion = consumer.preferredVersion;
if (consumer.migrationMode === 'dual') {
// send both, let consumer validate
await Promise.allSettled([
deliver(event, consumer, 'v1'),
deliver(event, consumer, targetVersion),
]);
return;
}
// normal single delivery
await deliver(event, consumer, targetVersion);
}
// consumer side:
// try parsing v2 first
// if v2 parsing fails, fall back to v1
// log the failure so you know what broke
GitHub does something similar with their webhook version transitions. They let you configure a test endpoint that receives the new format while your production endpoint keeps getting the old one. Smart, because it means zero risk for the consumer during evaluation.
Don't Version What You Can Extend
Quick aside that saves a lot of pain: most "breaking changes" don't actually need a version bump if you design your payloads right from the start. Adding new fields isn't breaking. Adding new event types isn't breaking. Adding new values to an enum, if your consumers are written defensively, isn't breaking.
What's breaking: removing fields, renaming fields, changing types (string to int), restructuring nested objects, changing the meaning of existing values.
If you find yourself versioning more than once every 18-24 months, your initial payload design probably needs work. Shopify has run their webhook format for years with additive changes only. They version rarely because they got the envelope structure right early.
The Actual Shutdown Checklist
When you've done the graduated degradation, sent the emails, run the dual-send period, and your dashboard shows single-digit consumers remaining on the old version, here's the shutdown sequence that actually works:
First, flip the old version to "log only" mode. Requests still get serialized and logged, but they go to a dead letter queue instead of the consumer's endpoint. Run this for a week. If nobody complains, nobody is actually processing the deliveries anymore.
Second, if someone does complain, you have the payloads in the dead letter queue. You can replay them to the consumer's endpoint while you sort out why they didn't migrate. No data loss.
Third, after the quiet period, remove the serialization code but keep the version's schema definitions for another 90 days. Someone will come back with "we were processing v1 payloads from a backup queue and now they look different." Having the schema lets you help them without guessing.
// version lifecycle states
const VERSION_STATES = {
ACTIVE: 'active', // full support, all retries
DEPRECATED: 'deprecated', // reduced retries, sunset headers
DRAINING: 'draining', // log-only, dead letter queue
ARCHIVED: 'archived', // schema kept, no serialization
REMOVED: 'removed', // gone for good
};
// minimum time in each state before transitioning
const MIN_STATE_DURATION_DAYS = {
DEPRECATED: 90,
DRAINING: 14,
ARCHIVED: 90,
};
Stop Postponing the Hard Conversation
Most teams I've talked to know they need to sunset old versions. They've known for months. The blocker is never technical. The code changes are straightforward. The blocker is that someone needs to have an uncomfortable conversation with a big customer about migrating their integration.
Have it sooner. The longer a deprecated version runs, the harder it gets to kill. New consumers accidentally register on it (because the docs still reference it, or because their SDK is outdated). Internal services start depending on it. What was supposed to be temporary becomes load-bearing.
Set a version limit. "We support the current version and one previous version. That's it." Put it in your API terms. Make it a policy, not a negotiation. Twilio does this well, and their ecosystem is massive. If they can hold the line, so can you.