Notes on DevOps, incident response, and building sustainable engineering cultures.
The 3-Pager That Fixed On-Call
Clarified scope, severity levels, and paging rules.
Problem
- Ambiguous scope of what on-call “owned.”
- Severity levels meant different things to different teams.
- Excessive noise → false pages → responder fatigue.
- Runbooks existed but weren’t authoritative or consistently used.
The Solution (3 pages, one owner)
- Scope: Systems/services explicitly in scope; escalations for everything else.
- Severity: SEV-1 (customer/business critical), SEV-2 (degraded/functional), SEV-3 (nuisance/ops toil) — with concrete examples.
- Paging rules: Pages only for SEV-1 and high-confidence SEV-2 signals. Everything else → ticket + business hours.
Guardrails we added
- Golden signals: latency, traffic, errors, saturation — per service.
- Runbook links inline for each common alert.
- Comms script: who says what, where, and when (Slack/Status/email).
- Single owner for updates to prevent drift.
Results
- ~40% fewer pages (noise removed), better sleep, better focus.
- Meaningful pages → faster time-to-mitigation and clearer handoffs.
- Post-incident reviews improved because severity was consistent org-wide.
Template (steal this)
- 1. Scope & Out-of-Scope
- 2. Severity table + examples
- 3. Paging rules + escalation flow
- Appendix: runbooks, dashboards, owners
Short, living, and owned. That’s what made it work.
From IC to Leader: A Lightweight Mentoring Path
How I help strong ICs become calm, trusted incident leaders.
The path
- Shadow — join incident channels as a quiet observer; review post-mortems together.
- Co-pilot — run a small portion (notes, timeline, or comms) with a senior lead present.
- Lead a drill — table-top exercises with clear injects and measurable outcomes.
- Own an incident — senior leader backstops; feedback within 24 hours.
What we practice
- Clarity over certainty: call the severity with available info, then adjust.
- Small batches: one change at a time, explicit rollback.
- Comms cadence: external/internal updates on a timer, not a feeling.
Artifacts
- Incident commander checklist (roles, comms, handoffs).
- Runbook skeleton (preconditions, steps, expected results, rollback).
- After-action template: facts → findings → fixes → follow-through (owners + dates).
Mentoring is a system: reps, feedback, and a safe runway to try leading for real.
Blameless ≠ Consequence-Free: Making Post-Mortems Stick
Turn incidents into durable improvements without witch hunts or wheel-spinning.
Principles
- Blame the system, not the person — design makes errors likely or unlikely.
- Bias to facts — timeline first, opinions later.
- Right-sized fixes — priority is preventing recurrence, not boiling the ocean.
The template
- Timeline: facts with timestamps and sources (dashboards, logs, comms).
- Customer impact: who, how long, severity.
- Contributing factors: technical + organizational.
- Actions: fix now (days), fix next (weeks), invest (quarter) — owners + dates.
- Follow-through: review action status weekly until done.
What changed when we did this
- Repeat issues dropped because actions had owners and deadlines.
- Engineers participated more; psychological safety increased.
- Leaders got better signal on where to invest (people, tooling, or process).
Post-mortems pay off when they drive change. That means owners, dates, and visible follow-up.