Why I don’t like discussing action items during incident reviews

I’m not a fan of talking about action items during incident reviews.

My whole shtick is that I believe updating people's mental models will have a more significant positive impact on the system than discussing action items, but boy is that a tough sell.
— @norootcause@hachyderm.io on mastodon (@norootcause) September 26, 2024

Judging from the incident review meetings I’ve attended throughout my career, this is a minority view, and I wanted to elaborate here on why I think this way. For more on this topic, I encourage readers to check out John Allspaw’s 2016 blog post entitled Etsy’s Debriefing Facilitation Guide for Blameless Postmortems, as well as the Etsy Debrief Facilitation Guide itself. Another starting point I will shamelessly recommend is Resilience engineering: where do I start?

Incident reviews

First, let’s talk about what an incident review is. It’s a meeting that takes place not too long after an incident has occurred, to discuss the incident. In many organizations, these meetings are open to any employee interested in attending, which means that these can have potentially large and varied audiences.

I was going to write “the goal of an incident review is…” in the paragraph above, but the whole purpose of this post is to articulate how my goals differ from other people’s goals.

My claims

Nobody fully understands how the system works. Once a company reaches a certain size, the software needs to get broken up across different teams. Ideally, the division is such that the teams are able to work relatively independent of each other. These are well-defined abstractions that lead to low coupling that we all prize in large-scale systems. As a consequence, there’s no single person who actually fully understands how the whole system works. It’s just too large and complex. And this actually understates the problem, given the complexity of the platforms we build on top of. Even if I’m the sole developer of a Java application, there’s a good chance that I don’t understand the details of the garbage collection behavior of the JVM I’m using.

The gaps in our understanding of how the system works contributes to incidents. Because we don’t have a full understanding of how the system works, we can’t ever fully reason about the impact of every single change that we make. I’d go so far as to say that, in every single incident, there’s something important that somebody didn’t know. That means that gaps in our understanding are dangerous in addition to being omnipresent.

The way that work is done profoundly affects incidents, both positively and negatively, but that work is mostly invisible. Software systems are socio-technical systems, and the work that the people in your organization do every day is part of how the system works. This day-to-day work enables, trigger, exacerbate, prevent, lessen, and remediate incidents. And sometimes the exact same work in one context will prevent an incident and in another context will enable an incident! However, we generally don’t see what the real work is like. I’m lucky if my teammates have any sense of what my day-to-day work looks like, including how I use the internal tools to accomplish this work. The likelihood that people on other teams know how I do this work is close to zero. Even the teams that maintain the internal tooling have few opportunities to see this work directly.

Incident reviews are an opportunity for many people to gain insight into how the system works. An incident review is an opportunity to examine an aspect of the socio-technical system in detail. It’s really the only meeting of its kind where you can potentially have such a varied cross-section of the company getting into the nitty-gritty details of how things work. Incident reviews give us a flashlight that we get to shine on a dark corner of the system.

The best way to get a better understanding of how the system behaves is to look at how the system actually behaved. This phrasing should sound obvious, but it’s the most provocative of these claims. Every minute you spend discussing action items is a minute you are not spending learning more about how the system behaved. I feel similarly about discussing counterfactuals (if there had been an alert…). These discussions take the focus away from how the system actually behaved, and enter a speculative world about how the system might behave under a different set of circumstances.

We don’t know what other people don’t know We all have incomplete, out-of-date models of how the system works, that includes our models of other people’s models! That means that, in general, we don’t know what other people don’t know about the system. We don’t know in advance what people are going to learn that they didn’t know before!

There are tight constraints on incident review meetings. There is a fixed amount of time in an incident review meeting, which means that every minute spend on topic X means one less minute to spend discussing topic Y. Once that meeting is over, the opportunity of bringing in this group of people together to update their mental models is now gone.

Action item discussions are likely to be of interest to a smaller fraction of the audience. This is a very subjective observation, but my theory is that people tend to find that incident reviews don’t have a lot of value precisely because they focus too much of the time on discussing action items, and the details of the proposed action items are of potential interest to only a very small subset of the audience.

Teams are already highly incentivized to implement action items that prevent recurrence. Often I’ll go to an incident review, and there will be mention of multiple action items that have already been completed. As an observer, I’ve never learned anything from hearing about these.

A learning meeting will never happen later, but action items discussion will. There’s no harm in having an action item discussion in a future meeting. In fact, teams are likely to have to do this when they do their planning work for the next quarter. However, once the incident review meeting is over, the opportunity for having a learning-style meeting is gone, because the org’s attention is gone and off to the next thing.

More learning up-front will improve the quality of action items. The more you learn about the system, the better your proposed action items are likely to be. But the reverse isn’t true.

Why not do both learning and action items during an incident review?

Hopefully the claims above address the question of why not do both activities. There’s a finite amount of time in an incident review meeting, which means there’s a fundamental tradeoff between time spent learning and time spent discussing action items, and I believe that devoting the entire time to learning will maximize the return-on-investment of the meeting. I also believe that additional action item discussions are much more likely to be able to happen after the incident review meeting, but that learning discussions won’t.

Why I think people emphasize action items

Here’s my mental model as to why I think people are so keen on emphasizing action items as the outcome of a meeting.

Learning is fuzzy, actions are concrete. An incident review meeting is an expensive meeting for an organization. Action items are a legible outcome of a meeting, they are an indicator to the organization that the meeting had value. The value of learning, of updated mental models, is invisible.

Incidents make orgs uncomfortable and action items reassure them. Incidents are evidence that we are not fully in control of our system, and action items make us feel like this uncomfortable uncertainty has been addressed.

View more on Lorin Hochstein's website »

Like • 0 comments • flag

Published on September 28, 2024 12:24

No comments have been added yet.

Lorin Hochstein's Blog

Lorin Hochstein's profile
35 followers