Operational excellence audit

On-call had become an attrition driver: two senior engineers cited it as their primary reason for leaving in the past six months, deploy frequency had fallen by half over eighteen months, and the same incidents kept recurring. Leadership asked us to find out why and chart a credible path back.

01 / Subject & scope

Where we looked and where we did not.

The subject is a consumer marketplace running roughly twenty-five backend services with eighty engineers across nine product and platform teams. The on-call rotation has become an attrition driver, and leadership commissioned this audit after the third consecutive quarter of declining deploy frequency.

We audited the on-call rotations, the incident process, the alerting stack, and the deployment pipeline. We did not audit application code quality or test coverage except where it surfaced as a direct driver of operational load.

02 / Methodology

How we gathered evidence.

Over four weeks we:

Reviewed ninety days of paging history across every rotation.
Read every incident post-mortem from the past six months — forty-seven documents.
Sat in on three incident reviews.
Shadowed two on-call shifts: one weekday, one weekend.
Interviewed eighteen engineers, four managers, two SREs, and the head of platform.
Audited the alerting configuration in Datadog and PagerDuty.
Reviewed the deploy pipeline, rollback paths, and the last quarter's deploys end-to-end.

03 / Findings

What we observed.

A. The alerting stack pages humans for problems the system should auto-recover.

62% of pages in the audit window resolved without human intervention or by the responder simply acknowledging. The most common page — "elevated 5xx rate" on a service with a known transient dependency — fired forty-three times in ninety days. The system treats warnings as alerts because severity classification was never agreed. On-call responders describe a "scroll past the page and check Slack first" pattern, which is the clearest possible signal of alert fatigue.

B. The incident process produces post-mortems but no learning.

Post-mortems exist for roughly 85% of SEV1s and SEV2s — better than most organizations we see. But the same three categories of failure recur across forty-seven documents: deploys with insufficient rollback, queue backpressure, and downstream-dependency timeouts. Action items from post-mortems are completed at a rate of roughly 30%; the rest are listed and quietly abandoned. No one owns the cumulative pattern across post-mortems.

C. Deploys are big, infrequent, and feared.

Median deploy size is thirty-eight PRs. Deploys happen 2.5 times per week on average, down from six per week eighteen months ago. 22% of deploys require a manual rollback or a hotfix within twelve hours. Engineers describe deploy days as "the day everything breaks." Code review depth scales with deploy size in a perverse way: bigger deploys make reviewers more nervous, which slows the process, which forces deploys to grow further.

D. On-call work is invisible to performance management.

Engineers spend a median of four to six hours of their on-call week on operational work. This time is not formally accounted for in delivery commitments. Teams that absorb more operational load consistently miss product targets — and the same engineers are then rated "delivery-challenged" in performance reviews. The cost is borne by the people doing the work, and they notice.

04 / Diagnosis

What is actually happening.

The operational tax is hidden, unmeasured, and unevenly distributed. Three teams of nine handle 70% of incidents because they own the most coupled services. The alerting stack converts noise into pages because no one is empowered to decide what is and isn't worth waking someone for. Deploys grow because deploys feel risky — which makes them riskier.

The incidents are not the problem. The system's inability to learn from them is.

05 / Recommendations

What we would change, in priority order.

1. Establish a paging severity bar, owned by the platform team.

Define what is worth paging a human for. Tune everything below the bar to a Slack channel. Expect a 40–60% reduction in pages within a quarter.

2. Run an incident retrospective on the retrospectives.

Quarterly review of all post-mortems in aggregate. Find the recurring patterns. Fund the structural fixes. Assign an owner for the patterns themselves, not just for individual incidents.

3. Move toward smaller, more frequent deploys.

Target median deploy size of eight PRs within a quarter. This requires deploy automation work and trunk discipline; we recommend dedicating one platform engineer to it for two quarters.

4. Account for operational load in delivery planning.

Each team budgets explicit operational capacity (we suggest 20% baseline, more for high-load teams). Teams that consistently exceed budget surface a structural problem to leadership rather than absorbing it silently.

5. Rotate operational ownership of shared services.

The three teams currently absorbing operational tax should not own this load permanently. Build a rotation that distributes operational burden across the broader engineering organization over twelve months.

06 / The first ninety days

A sequenced plan.

Weeks 1–2.

Convene a small group — head of platform, two senior engineers, SRE lead — to draft the paging severity bar. Begin tuning the worst-offending alerts immediately.

Weeks 3–4.

Run the first quarterly retrospective-of-retrospectives. Identify the top three recurring failure patterns. Name an owner for each.

Weeks 5–8.

Begin deploy automation work. Pilot smaller deploys with two volunteer teams. Introduce explicit operational capacity into one team's planning.

Weeks 9–12.

Measure paging rate, mean time to acknowledge, deploy frequency, deploy size, and rollback rate. Re-evaluate the alerting bar with real data.

Expected outcome: 40–60% reduction in pages, 30–50% reduction in unplanned operational work, and stabilization of deploy frequency within ninety days. We would rather promise less and observe more.