On-Call Culture Is Broken. Here Is How to Fix It.
You shipped a feature on Friday at 4pm. At 3am Saturday, your phone buzzes. The payment service is throwing 500 errors. You spend 90 minutes debugging a database connection pool exhaustion, push a fix through an emergency deployment process that skips code review, and go back to sleep knowing the fix is fragile but functional.
On Monday, nobody asks about it. You mention the incident in standup and get a polite acknowledgment. There’s no post-incident review. No action items to prevent recurrence. The same class of bug — connection pool exhaustion under unexpected load — has happened three times this quarter.
This is on-call at most companies. And it’s fundamentally broken.
Why On-Call Fails
The core failure of most on-call programs isn’t that engineers get paged. Pages are inevitable in any system with meaningful complexity and real traffic. The core failure is that nobody learns from the page.
The Treadmill Problem
The traditional model treats on-call as a tax — an unpleasant obligation that comes with the job, like filing expense reports or attending all-hands meetings. Senior engineers delegate it to junior engineers as quickly as possible. Junior engineers endure it until they gain enough seniority to avoid it.
But on-call pages are signal. They’re your system telling you where it’s fragile, where it’s under-provisioned, where the assumptions you made during design don’t hold under production conditions. Ignoring that signal doesn’t make the fragility disappear; it makes it accumulate until something catastrophic fails during business hours in front of customers.
Every unexamined page is a missed opportunity to make the system more reliable. Every recurring page is evidence of an organizational failure to learn.
The Fairness Problem
In most organizations, on-call burden falls disproportionately on a small group of 3-5 senior engineers who are the only ones who understand the critical-path systems well enough to diagnose and fix production issues at 3 AM.
These engineers are simultaneously the most valuable contributors to new feature development and the most burdened by operational work. They burn out at rates far exceeding the team average. When they leave — and they always do, because chronic sleep disruption and unsustainable workload create inevitable attrition — they take institutional knowledge with them. The remaining on-call burden concentrates on an even smaller group, accelerating the cycle.
The Sleep Deprivation Problem
The research on cognitive performance under sleep deprivation is conclusive and alarming. After 17-19 hours of sustained wakefulness (the equivalent of being woken at 3 AM after going to sleep at midnight), cognitive performance degrades to levels comparable to a blood alcohol content of 0.05%. After 24 hours without sleep, performance degrades to the equivalent of 0.10% — legally drunk.
An engineer who fixed a production incident at 3 AM and comes to work at 9 AM is making decisions with impaired cognition. They’re writing code, reviewing pull requests, and making architectural choices at the equivalent of intoxicated-level judgment. And nobody acknowledges this, because the cultural expectation is “you handle the page, you come to work the next day, you’re fine.”
The Better Model
1. Follow-the-Sun When Possible
If your organization has engineers in multiple time zones, nobody should ever be paged at 3 AM. Structure your on-call rotation so that the primary responder is always in their local working hours.
This isn’t a luxury for large organizations — it’s a mathematical calculation that favors any organization with geographic distribution. An engineer solving a production incident at 3 AM takes 3x longer on average, makes worse diagnostic decisions, implements more fragile fixes, and is less productive the following day than the same engineer solving the same incident at 11 AM. The productivity hit of 3 AM incident response typically exceeds the coordination overhead of follow-the-sun handoffs.
For organizations without geographic distribution, consider on-call hours that limit pages to 7 AM-11 PM, with automated mitigation (auto-scaling, circuit breakers, failover) handling the overnight hours. The alerts still fire, but they’re queued for morning investigation rather than waking someone up.
2. You Build It, You Run It (With Guardrails)
The team that ships the code should be the team that responds when it breaks. This creates a direct feedback loop between development practices and operational consequences.
When deploying on Friday at 4 PM causes a 3 AM weekend page, the team naturally adjusts: they deploy earlier in the week, ship smaller changes, improve test coverage for the specific failure mode that paged them, and add monitoring that would detect the issue before customers are affected.
This feedback loop is the single most effective mechanism for improving system reliability — more effective than any monitoring tool, any process document, or any management directive.
But guardrails are essential to prevent the model from becoming exploitative:
- No engineer on-call for more than one week per month. The cognitive burden of being “ready to respond” — even without actual pages — is measurable. On-call weeks should be the exception, not the default state.
- No single point of failure. No system should have a bus factor of 1 for on-call response. If only one person can debug the payment service, that’s an organizational risk that must be addressed through knowledge sharing, documentation, and cross-training before it becomes a crisis.
- Escalation paths that work. The primary on-call engineer should always have a clear, tested escalation path to senior engineers or specialized teams for issues beyond their experience level.
3. The Postmortem Is Not Optional
Every page that wakes someone up — every incident that required human intervention outside of working hours — should produce a postmortem answering three questions:
- What happened? Not “the service crashed” — the specific chain of events from trigger to detection to response to resolution.
- Why didn’t we catch it before it paged? This is the most important question. If monitoring didn’t detect it, why not? If it was detected but not actionable, why not? If the failure mode was known but not addressed, what prevented the fix?
- What are we changing so this specific failure can’t happen again? Not “we’ll be more careful” — a specific, assignable, trackable action item. A new alert. A configuration change. A code fix. A capacity increase.
Not every postmortem needs to be a formal, hour-long meeting with stakeholders. Many can be a 15-minute written document. But the practice of examining every incident and committing to at least one concrete preventive action is what separates organizations with improving reliability from organizations with recurring nightmares.
4. Compensate On-Call Explicitly
On-call is work. It constrains your personal life — you can’t travel freely, you can’t drink alcohol, you can’t be in locations without cell service. It causes sleep disruption that has measurable health consequences. It requires continuous technical readiness.
If your organization treats on-call as an uncompensated expectation of the engineering role, you’re telling engineers that their off-hours availability has zero financial value. This message is received clearly, even if never stated explicitly.
Compensation models that work effectively:
- Per-shift pay: $500-$1,500 per week of primary on-call responsibility, adjusted for historical page frequency and severity
- Comp time: Guaranteed day off after any night or weekend page, automatically scheduled and non-negotiable
- Incident bonuses: Meaningful bonus ($200-$500) for handling high-severity incidents, regardless of outcome
- Page-free incentives: Small bonus for weeks where the on-call engineer isn’t paged, incentivizing the team to build reliable systems
5. Track and Improve Systematically
Treat on-call quality as a measurable engineering metric:
- Page volume per engineer per week — trending this over time reveals whether reliability is improving
- Mean time to acknowledgment — how quickly on-call engineers respond
- Mean time to resolution — how quickly incidents are resolved
- Repeat page rate — what percentage of this month’s pages are for previously-seen problems
The Leading Indicator
Want to know if your on-call program is healthy? Track one metric above all others: repeat page rate. What percentage of pages this month are for problems that have paged before?
If that number is above 30%, your organization isn’t learning from incidents. The same failures are happening repeatedly, the same engineers are being woken up for the same reasons, and the postmortem process (if it exists at all) isn’t producing effective remediation.
A healthy on-call program drives the repeat page rate below 10% — meaning that 90%+ of pages are truly novel issues, not recurring failures that were identified but never fixed.
The Garnet Grid perspective: Operational reliability isn’t just tooling — it’s culture, process, and investment. Our DevOps maturity assessment covers everything from on-call practices to deployment safety to incident learning. Explore the assessment →