Two thought experiments
Here’s a thought experiment that John Allspaw related to me, in paraphrased form (John tells me that he will eventually capture this in a blog post of his own, at which time I’ll put a proper link).
Consider a small-ish tech company that has four engineering teams (A,B,C,D), where an engineer from Team A was involved in an incident (In John’s telling, the incident involves the Norway problem). In the wake of this incident, a post-incident write-up is completed, and the write-up does a good job of describing what happened. Next, imagine that the write-up is made available to teams A,B, and C, but not to team D. Nobody on team D is allowed to read the write-up, and nobody from the other teams is permitted to speak to team D about the details of the incident. The question is: are the members of team D at a disadvantage compared to the other teams?
The point of this scenario is to convey the intuition that, even though team D wasn’t involved in the incident, its members can still learn something from its details that makes them better engineers.
Switching gears for a moment, let’s talk about the new tools that are emerging under the label AI SRE. We’re now starting to see more tools that leverage LLMs to try to automate incident diagnosis and remediation, such as incident.io’s AI SRE product, Datadog’s Bits AI SRE, Resolve.ai (tagline: Your always-on AI SRE), and Cleric (tagline: AI SRE teammate). These tools work by reading in signals from your organization such as alerts, metrics, Slack messages, and source code repositories.
To effectively diagnose what’s happening in your system, you don’t just want to know what’s happening right now, but you also want to have access to historical data, since maybe there was a similar problem that happened, say, a year ago. While LLMs will have been trained with a lot of general knowledge about software systems, it won’t have been trained on the specific details of your system, and your system will fail in system-specific ways, which means that (I assume!) these AI SRE systems will work better if they have access to historical data about your system.
Here’s second thought experiment, this one my own: Imagine that you’ve adopted one of these AI SRE tools, but the only historical data of the system that you can feed your tool is the collection of your company’s post-incident write-ups. What kinds of details would be useful to an AI SRE tool in helping to troubleshoot future incidents? Perhaps we should encourage people to write their incident reports as if they will be consumed by an AI SRE tool that will use it to learn as much as possible about the work involved in diagnosing and remediating incidents in your company. I bet the humans who read it would learn more that way too.


