How to Run a Blameless Postmortem That Actually Improves Things

Engineering team gathered for a collaborative incident review meeting

At 3:47 AM on a Sunday, I pushed a configuration change that took BirJob offline for 2 hours and 14 minutes. The change was supposed to update our scraper concurrency limit. Instead, it set the database connection pool size to zero. No connections, no queries, no website.

I fixed it in 8 minutes once I woke up to the alerts. But the outage lasted over two hours because my alerts were misconfigured — they were sent to a Slack channel I had muted. The root cause was not the bad config value. It was not the muted Slack channel. It was the absence of a validation step that would have caught an obviously wrong value before it reached production.

I ran a postmortem on this incident — with myself. Yes, it felt absurd. But it produced three concrete improvements that have prevented similar issues since. This article is about how to run postmortems that produce those kinds of improvements, whether you are a solo developer or part of a hundred-person team.

1. Why Most Postmortems Fail

Most organizations do postmortems wrong. They either skip them entirely ("we fixed it, let us move on"), turn them into blame sessions ("whose fault was this?"), or produce documents that no one reads and no action items that anyone completes.

Google's SRE book defines a postmortem as "a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring." The emphasis on "follow-up actions" is crucial — a postmortem without action items is just a story.

According to PagerDuty's 2024 State of Digital Operations report, 73% of organizations conduct postmortems after major incidents, but only 34% consistently complete the action items identified. That gap — between knowing what to fix and actually fixing it — is where most postmortem processes break down.

2. What "Blameless" Actually Means

Blameless does not mean "no one is responsible." It means the postmortem focuses on system failures rather than individual failures. The distinction is important:

Blame-Oriented	Blameless
"John pushed a bad config change."	"A config change with an invalid value reached production without validation."
"QA should have caught this."	"Our test suite does not cover this failure mode."
"The on-call engineer took too long to respond."	"Alert routing did not reach the on-call engineer through the expected channel."
"This would not have happened if they followed the deployment checklist."	"The deployment process does not enforce the checklist steps automatically."

The blameless framing is not about being nice. It is about effectiveness. When people fear punishment, they hide information. When they hide information, you cannot find the real root causes. When you cannot find root causes, the same incidents keep happening. Sidney Dekker's "Just Culture" research shows that organizations with blameless cultures have 40% fewer repeat incidents than those with punitive approaches.

3. The Postmortem Process: Step by Step

Step 1: Declare the Postmortem (Within 24 Hours)

Not every incident needs a postmortem. Here are criteria for when to conduct one:

Any user-facing downtime exceeding your SLO threshold
Data loss of any kind
Security breach or near-miss
Incident requiring manual intervention to resolve
Incident where the on-call person escalated
Any incident that "felt close" — a near-miss that could have been worse

Step 2: Gather Timeline Data (Before the Meeting)

The postmortem facilitator should compile a timeline before the meeting. Do not rely on memory — use logs, monitoring data, chat transcripts, and git history.

## Incident Timeline

| Time (UTC)  | Event                                              | Source       |
|-------------|-----------------------------------------------------|-------------|
| 03:42       | Config change deployed via CI/CD                    | GitHub Actions|
| 03:43       | Database connection errors begin                    | Application logs|
| 03:44       | Health check failures, load balancer removes server | Nginx logs  |
| 03:45       | First user error reports (Twitter)                  | Twitter     |
| 03:47       | Alert sent to #alerts-muted channel                 | PagerDuty   |
| 05:52       | Engineer wakes up, sees Twitter notification         | Manual      |
| 05:55       | Root cause identified (pool_size=0)                 | Git diff    |
| 06:00       | Config reverted, service recovering                 | GitHub Actions|
| 06:01       | Full service restoration confirmed                  | Monitoring  |

Step 3: Run the Meeting (30-60 Minutes)

Team collaborating around a table during an incident review session

The meeting should follow this structure:

Set the tone (2 min): "This is a blameless postmortem. We are here to understand what happened and prevent it from happening again. We are not here to assign blame."
Review the timeline (10 min): Walk through the timeline. Correct any inaccuracies. Fill in gaps.
Ask "why" repeatedly (15-20 min): Use the "5 Whys" technique or a more nuanced contributing factors analysis.
Identify what went well (5 min): This is often skipped but crucial. What systems, processes, or actions helped mitigate the impact?
Define action items (10-15 min): Specific, assignable, measurable improvements.
Assign owners and deadlines (5 min): Every action item needs an owner and a deadline. Items without owners never get done.

4. Root Cause Analysis: Beyond "5 Whys"

The "5 Whys" technique is popular but often oversimplifies complex incidents. Real incidents usually have multiple contributing factors, not a single root cause. I prefer the contributing factors model:

Factor Category	Example from Our Incident	Improvement
Technical	No validation on config values	Add schema validation for all config files
Process	Config changes bypass review for "minor" changes	All production config changes require PR review
Alerting	Alert routed to muted channel	Critical alerts must use PagerDuty phone escalation
Testing	No integration test for config loading	Add test that verifies config loads with valid values
Documentation	Config parameter constraints not documented	Document valid ranges for all config parameters

Notice how this gives us five improvements instead of one "root cause." Each improvement independently reduces the likelihood of a similar incident. Together, they create defense in depth. John Allspaw's work on incident analysis emphasizes that real incidents are caused by multiple factors that are "each necessary but only jointly sufficient" — fixing any one of them would have prevented the incident.

5. Writing the Postmortem Document

The postmortem document is not just a record — it is a teaching tool. Future team members will read it to understand what can go wrong and how to prevent it. Here is a template that works:

## Postmortem: [Incident Title]

**Date:** 2026-03-15
**Duration:** 2 hours 14 minutes
**Severity:** SEV-2 (service degradation, > 50% users affected)
**Author:** Ismat Asamadov
**Status:** Action items in progress

### Summary
A configuration change set the database connection pool size to 0,
causing a complete service outage for 2 hours and 14 minutes.
Approximately 12,000 users were affected.

### Impact
- 100% of web requests returned 502 errors for 2h14m
- ~200 job application redirects were lost
- No data loss (scraper runs are idempotent)
- Estimated revenue impact: 0 AZN (no paid features affected)

### Timeline
[Detailed timeline as shown above]

### Root Cause / Contributing Factors
[Table of contributing factors as shown above]

### What Went Well
- Service recovered quickly once the engineer was engaged (8 minutes)
- Monitoring correctly detected the outage
- Deployment pipeline made rollback fast (< 3 minutes)

### What Went Wrong
- Alert did not reach the on-call engineer for 2+ hours
- No validation prevented an invalid config value
- No automated rollback on health check failure

### Action Items
| # | Action | Owner | Priority | Deadline | Status |
|---|--------|-------|----------|----------|--------|
| 1 | Add JSON schema validation for all config files | Ismat | P1 | 2026-03-22 | Done |
| 2 | Route critical alerts through PagerDuty phone | Ismat | P0 | 2026-03-16 | Done |
| 3 | Add auto-rollback on sustained health check failure | Ismat | P1 | 2026-03-29 | In Progress |
| 4 | Require PR review for ALL production config changes | Ismat | P2 | 2026-03-22 | Done |
| 5 | Document valid ranges for config parameters | Ismat | P3 | 2026-04-05 | Open |

### Lessons Learned
Configuration changes are code changes and should be treated with
the same rigor: reviewed, tested, and validated before deployment.

6. Action Items That Actually Get Done

The most common failure mode of postmortems is that action items are identified but never completed. Here is how to fix that:

Make Them Specific

Bad Action Item	Good Action Item
"Improve monitoring"	"Add a PagerDuty phone escalation for all P0/P1 alerts by March 22"
"Better testing"	"Add integration test that verifies database connection pool initializes with > 0 connections"
"Review deployment process"	"Add a CI step that validates config JSON against schema before deployment"
"Update documentation"	"Add valid ranges (min/max) to config.example.json comments for pool_size, timeout, and retry_count"

Track Them Religiously

Create the action items as tickets in your issue tracker (Jira, Linear, GitHub Issues) immediately after the postmortem meeting. Do not leave them in a Google Doc. Review completion status in your weekly team meeting. According to Atlassian's incident management guide, teams that track postmortem action items in their standard issue tracker complete 78% of them, compared to 23% for teams that track them in standalone documents.

Prioritize Ruthlessly

Not every action item needs to be done immediately. Use this priority framework:

P0 (Do today): The same incident could happen again right now
P1 (Do this sprint): High-impact prevention with clear implementation
P2 (Do this quarter): Important but not urgent, or requires significant effort
P3 (Backlog): Nice to have, will prevent edge cases

7. Common Anti-Patterns

Warning signs and red flags in a process review context

The Blame Game

"John caused the outage." Even if John's action triggered the incident, the system should have prevented it. Why was it possible for one person's action to cause a production outage? That is the system failure to investigate.

The Single Root Cause Fallacy

"The root cause was a typo." No. The root cause is that your system allows typos to reach production. Typos are inevitable; production outages from typos are preventable.

The Action Item Graveyard

Fifty action items from the last ten postmortems, none completed. This is worse than having no postmortem process at all, because it teaches the team that postmortems are performative — they exist to check a box, not to improve things.

The Missing Postmortem

"It was a small incident, we do not need a postmortem." Small incidents are where the biggest learning opportunities hide. They are the canaries in the coal mine. A 5-minute outage that you fixed quickly might reveal the same systemic weakness as a 5-hour outage that you catch before it happens.

The Recency Bias

Only analyzing the trigger (what changed right before the incident) while ignoring latent conditions (what was already broken). The trigger is usually the least interesting part. The interesting question is: what pre-existing conditions made the trigger dangerous?

8. Building a Postmortem Culture

A postmortem process is only as good as the culture that supports it. Here is how to build that culture:

Leaders go first. When a leader publicly runs a blameless postmortem on their own mistake, it signals that postmortems are about learning, not punishment.
Celebrate postmortems. Share them broadly. At Google, postmortems are available to the entire company. They are treated as valuable learning resources, not embarrassing confessions.
Never punish the messenger. If someone discovers they caused an incident and reports it, they should be thanked, not punished. The alternative — people hiding incidents — is far more dangerous.
Review postmortems quarterly. Look for patterns across incidents. Are the same types of incidents recurring? Are action items being completed? Are the right types of improvements being made?
Invest in prevention. If your team is constantly fighting fires, they will not have time for postmortems or prevention. Create space for reliability work.

9. My Opinionated Take

Every incident is a gift. I know this sounds like toxic positivity, but I genuinely believe it. Every incident reveals something about your system that you did not know. Without the incident, that weakness would remain hidden until it caused something worse. The postmortem is how you extract value from the incident.

Three action items are better than fifteen. After a major incident, it is tempting to list every possible improvement. Resist this. Pick the three highest-impact items that you will actually complete. Three completed improvements are worth infinitely more than fifteen items in a backlog.

Blamelessness is not optional. If you cannot run blameless postmortems, you cannot improve. People will hide information, cover up mistakes, and avoid reporting near-misses. Your postmortem process will produce fiction, not insights. This is non-negotiable.

Automate the human out of the loop. Every time a postmortem reveals that a human made an error, the correct action item is not "tell humans to be more careful." It is "make it impossible for this error to reach production." Humans are unreliable; systems should compensate for that.

10. Action Plan: Implementing Postmortems

Week 1: Set Up the Process

Create a postmortem template in your team's wiki/docs
Define incident severity levels and which require postmortems
Designate a postmortem facilitator role (can rotate)
Create a shared folder/channel for postmortem documents

Week 2: Run Your First Postmortem

Pick a recent incident (even a minor one)
Compile the timeline from logs and chat history
Run the meeting following the structure above
Write up the document and share it with the team
Create action item tickets in your issue tracker

Ongoing: Build the Habit

Run a postmortem within 48 hours of every qualifying incident
Review action item completion weekly
Review postmortem patterns quarterly
Celebrate completed action items — they represent real improvement

Continuous improvement and learning culture in engineering

Sources

I'm Ismat, and I build BirJob — Azerbaijan's job aggregator scraping 80+ sources daily.

Loading BirJob...

How to Run a Blameless Postmortem That Actually Improves Things

How to Run a Blameless Postmortem That Actually Improves Things

1. Why Most Postmortems Fail

2. What "Blameless" Actually Means

3. The Postmortem Process: Step by Step

Step 1: Declare the Postmortem (Within 24 Hours)

Step 2: Gather Timeline Data (Before the Meeting)

Step 3: Run the Meeting (30-60 Minutes)

4. Root Cause Analysis: Beyond "5 Whys"

5. Writing the Postmortem Document

6. Action Items That Actually Get Done

Make Them Specific

Track Them Religiously

Prioritize Ruthlessly

7. Common Anti-Patterns

The Blame Game

The Single Root Cause Fallacy

The Action Item Graveyard

The Missing Postmortem

The Recency Bias

8. Building a Postmortem Culture

9. My Opinionated Take

10. Action Plan: Implementing Postmortems

Sources

İş axtarışınıza başlayın