How to Run a Blameless Postmortem That Actually Improves Things
At 3:47 AM on a Sunday, I pushed a configuration change that took BirJob offline for 2 hours and 14 minutes. The change was supposed to update our scraper concurrency limit. Instead, it set the database connection pool size to zero. No connections, no queries, no website.
I fixed it in 8 minutes once I woke up to the alerts. But the outage lasted over two hours because my alerts were misconfigured — they were sent to a Slack channel I had muted. The root cause was not the bad config value. It was not the muted Slack channel. It was the absence of a validation step that would have caught an obviously wrong value before it reached production.
I ran a postmortem on this incident — with myself. Yes, it felt absurd. But it produced three concrete improvements that have prevented similar issues since. This article is about how to run postmortems that produce those kinds of improvements, whether you are a solo developer or part of a hundred-person team.
1. Why Most Postmortems Fail
Most organizations do postmortems wrong. They either skip them entirely ("we fixed it, let us move on"), turn them into blame sessions ("whose fault was this?"), or produce documents that no one reads and no action items that anyone completes.
Google's SRE book defines a postmortem as "a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring." The emphasis on "follow-up actions" is crucial — a postmortem without action items is just a story.
According to PagerDuty's 2024 State of Digital Operations report, 73% of organizations conduct postmortems after major incidents, but only 34% consistently complete the action items identified. That gap — between knowing what to fix and actually fixing it — is where most postmortem processes break down.
2. What "Blameless" Actually Means
Blameless does not mean "no one is responsible." It means the postmortem focuses on system failures rather than individual failures. The distinction is important:
| Blame-Oriented | Blameless |
|---|---|
| "John pushed a bad config change." | "A config change with an invalid value reached production without validation." |
| "QA should have caught this." | "Our test suite does not cover this failure mode." |
| "The on-call engineer took too long to respond." | "Alert routing did not reach the on-call engineer through the expected channel." |
| "This would not have happened if they followed the deployment checklist." | "The deployment process does not enforce the checklist steps automatically." |
The blameless framing is not about being nice. It is about effectiveness. When people fear punishment, they hide information. When they hide information, you cannot find the real root causes. When you cannot find root causes, the same incidents keep happening. Sidney Dekker's "Just Culture" research shows that organizations with blameless cultures have 40% fewer repeat incidents than those with punitive approaches.
3. The Postmortem Process: Step by Step
Step 1: Declare the Postmortem (Within 24 Hours)
Not every incident needs a postmortem. Here are criteria for when to conduct one:
- Any user-facing downtime exceeding your SLO threshold
- Data loss of any kind
- Security breach or near-miss
- Incident requiring manual intervention to resolve
- Incident where the on-call person escalated
- Any incident that "felt close" — a near-miss that could have been worse
Step 2: Gather Timeline Data (Before the Meeting)
The postmortem facilitator should compile a timeline before the meeting. Do not rely on memory — use logs, monitoring data, chat transcripts, and git history.
## Incident Timeline
| Time (UTC) | Event | Source |
|-------------|-----------------------------------------------------|-------------|
| 03:42 | Config change deployed via CI/CD | GitHub Actions|
| 03:43 | Database connection errors begin | Application logs|
| 03:44 | Health check failures, load balancer removes server | Nginx logs |
| 03:45 | First user error reports (Twitter) | Twitter |
| 03:47 | Alert sent to #alerts-muted channel | PagerDuty |
| 05:52 | Engineer wakes up, sees Twitter notification | Manual |
| 05:55 | Root cause identified (pool_size=0) | Git diff |
| 06:00 | Config reverted, service recovering | GitHub Actions|
| 06:01 | Full service restoration confirmed | Monitoring |
Step 3: Run the Meeting (30-60 Minutes)
The meeting should follow this structure:
- Set the tone (2 min): "This is a blameless postmortem. We are here to understand what happened and prevent it from happening again. We are not here to assign blame."
- Review the timeline (10 min): Walk through the timeline. Correct any inaccuracies. Fill in gaps.
- Ask "why" repeatedly (15-20 min): Use the "5 Whys" technique or a more nuanced contributing factors analysis.
- Identify what went well (5 min): This is often skipped but crucial. What systems, processes, or actions helped mitigate the impact?
- Define action items (10-15 min): Specific, assignable, measurable improvements.
- Assign owners and deadlines (5 min): Every action item needs an owner and a deadline. Items without owners never get done.
4. Root Cause Analysis: Beyond "5 Whys"
The "5 Whys" technique is popular but often oversimplifies complex incidents. Real incidents usually have multiple contributing factors, not a single root cause. I prefer the contributing factors model:
| Factor Category | Example from Our Incident | Improvement |
|---|---|---|
| Technical | No validation on config values | Add schema validation for all config files |
| Process | Config changes bypass review for "minor" changes | All production config changes require PR review |
| Alerting | Alert routed to muted channel | Critical alerts must use PagerDuty phone escalation |
| Testing | No integration test for config loading | Add test that verifies config loads with valid values |
| Documentation | Config parameter constraints not documented | Document valid ranges for all config parameters |
Notice how this gives us five improvements instead of one "root cause." Each improvement independently reduces the likelihood of a similar incident. Together, they create defense in depth. John Allspaw's work on incident analysis emphasizes that real incidents are caused by multiple factors that are "each necessary but only jointly sufficient" — fixing any one of them would have prevented the incident.
5. Writing the Postmortem Document
The postmortem document is not just a record — it is a teaching tool. Future team members will read it to understand what can go wrong and how to prevent it. Here is a template that works:
## Postmortem: [Incident Title]
**Date:** 2026-03-15
**Duration:** 2 hours 14 minutes
**Severity:** SEV-2 (service degradation, > 50% users affected)
**Author:** Ismat Asamadov
**Status:** Action items in progress
### Summary
A configuration change set the database connection pool size to 0,
causing a complete service outage for 2 hours and 14 minutes.
Approximately 12,000 users were affected.
### Impact
- 100% of web requests returned 502 errors for 2h14m
- ~200 job application redirects were lost
- No data loss (scraper runs are idempotent)
- Estimated revenue impact: 0 AZN (no paid features affected)
### Timeline
[Detailed timeline as shown above]
### Root Cause / Contributing Factors
[Table of contributing factors as shown above]
### What Went Well
- Service recovered quickly once the engineer was engaged (8 minutes)
- Monitoring correctly detected the outage
- Deployment pipeline made rollback fast (< 3 minutes)
### What Went Wrong
- Alert did not reach the on-call engineer for 2+ hours
- No validation prevented an invalid config value
- No automated rollback on health check failure
### Action Items
| # | Action | Owner | Priority | Deadline | Status |
|---|--------|-------|----------|----------|--------|
| 1 | Add JSON schema validation for all config files | Ismat | P1 | 2026-03-22 | Done |
| 2 | Route critical alerts through PagerDuty phone | Ismat | P0 | 2026-03-16 | Done |
| 3 | Add auto-rollback on sustained health check failure | Ismat | P1 | 2026-03-29 | In Progress |
| 4 | Require PR review for ALL production config changes | Ismat | P2 | 2026-03-22 | Done |
| 5 | Document valid ranges for config parameters | Ismat | P3 | 2026-04-05 | Open |
### Lessons Learned
Configuration changes are code changes and should be treated with
the same rigor: reviewed, tested, and validated before deployment.
6. Action Items That Actually Get Done
The most common failure mode of postmortems is that action items are identified but never completed. Here is how to fix that:
Make Them Specific
| Bad Action Item | Good Action Item |
|---|---|
| "Improve monitoring" | "Add a PagerDuty phone escalation for all P0/P1 alerts by March 22" |
| "Better testing" | "Add integration test that verifies database connection pool initializes with > 0 connections" |
| "Review deployment process" | "Add a CI step that validates config JSON against schema before deployment" |
| "Update documentation" | "Add valid ranges (min/max) to config.example.json comments for pool_size, timeout, and retry_count" |
Track Them Religiously
Create the action items as tickets in your issue tracker (Jira, Linear, GitHub Issues) immediately after the postmortem meeting. Do not leave them in a Google Doc. Review completion status in your weekly team meeting. According to Atlassian's incident management guide, teams that track postmortem action items in their standard issue tracker complete 78% of them, compared to 23% for teams that track them in standalone documents.
Prioritize Ruthlessly
Not every action item needs to be done immediately. Use this priority framework:
- P0 (Do today): The same incident could happen again right now
- P1 (Do this sprint): High-impact prevention with clear implementation
- P2 (Do this quarter): Important but not urgent, or requires significant effort
- P3 (Backlog): Nice to have, will prevent edge cases
7. Common Anti-Patterns
The Blame Game
"John caused the outage." Even if John's action triggered the incident, the system should have prevented it. Why was it possible for one person's action to cause a production outage? That is the system failure to investigate.
The Single Root Cause Fallacy
"The root cause was a typo." No. The root cause is that your system allows typos to reach production. Typos are inevitable; production outages from typos are preventable.
The Action Item Graveyard
Fifty action items from the last ten postmortems, none completed. This is worse than having no postmortem process at all, because it teaches the team that postmortems are performative — they exist to check a box, not to improve things.
The Missing Postmortem
"It was a small incident, we do not need a postmortem." Small incidents are where the biggest learning opportunities hide. They are the canaries in the coal mine. A 5-minute outage that you fixed quickly might reveal the same systemic weakness as a 5-hour outage that you catch before it happens.
The Recency Bias
Only analyzing the trigger (what changed right before the incident) while ignoring latent conditions (what was already broken). The trigger is usually the least interesting part. The interesting question is: what pre-existing conditions made the trigger dangerous?
8. Building a Postmortem Culture
A postmortem process is only as good as the culture that supports it. Here is how to build that culture:
- Leaders go first. When a leader publicly runs a blameless postmortem on their own mistake, it signals that postmortems are about learning, not punishment.
- Celebrate postmortems. Share them broadly. At Google, postmortems are available to the entire company. They are treated as valuable learning resources, not embarrassing confessions.
- Never punish the messenger. If someone discovers they caused an incident and reports it, they should be thanked, not punished. The alternative — people hiding incidents — is far more dangerous.
- Review postmortems quarterly. Look for patterns across incidents. Are the same types of incidents recurring? Are action items being completed? Are the right types of improvements being made?
- Invest in prevention. If your team is constantly fighting fires, they will not have time for postmortems or prevention. Create space for reliability work.
9. My Opinionated Take
Every incident is a gift. I know this sounds like toxic positivity, but I genuinely believe it. Every incident reveals something about your system that you did not know. Without the incident, that weakness would remain hidden until it caused something worse. The postmortem is how you extract value from the incident.
Three action items are better than fifteen. After a major incident, it is tempting to list every possible improvement. Resist this. Pick the three highest-impact items that you will actually complete. Three completed improvements are worth infinitely more than fifteen items in a backlog.
Blamelessness is not optional. If you cannot run blameless postmortems, you cannot improve. People will hide information, cover up mistakes, and avoid reporting near-misses. Your postmortem process will produce fiction, not insights. This is non-negotiable.
Automate the human out of the loop. Every time a postmortem reveals that a human made an error, the correct action item is not "tell humans to be more careful." It is "make it impossible for this error to reach production." Humans are unreliable; systems should compensate for that.
10. Action Plan: Implementing Postmortems
Week 1: Set Up the Process
- Create a postmortem template in your team's wiki/docs
- Define incident severity levels and which require postmortems
- Designate a postmortem facilitator role (can rotate)
- Create a shared folder/channel for postmortem documents
Week 2: Run Your First Postmortem
- Pick a recent incident (even a minor one)
- Compile the timeline from logs and chat history
- Run the meeting following the structure above
- Write up the document and share it with the team
- Create action item tickets in your issue tracker
Ongoing: Build the Habit
- Run a postmortem within 48 hours of every qualifying incident
- Review action item completion weekly
- Review postmortem patterns quarterly
- Celebrate completed action items — they represent real improvement
Sources
- Google SRE Book — Postmortem Culture
- PagerDuty — 2024 State of Digital Operations
- John Allspaw — Each Necessary But Only Jointly Sufficient
- Atlassian — Incident Management and Postmortems
- Sidney Dekker — Just Culture
I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.
