Confusion is a hallmark of a complex incident. In the moment, we know something is wrong, but we struggle to make sense of the different signals that we’re seeing. We don’t understand the underlying failure mode.
After the incident is over and the engineers have had a chance to dig into what happened, these confusing signals make sense in retrospect. We find out that about the bug or inadvertent config change or unexpected data corruption that led to the symptoms we saw during the incident.
When writing up the narrative, the incident investigator must choose whether to inform the reader in advance about the details of the failure mode, or to withhold this info until the point in time in the narrative when the engineers involved understood what was happening.
I prefer the first approach: giving the reader information about the failure mode details in the narrative before the actors involved in the incident have that information. This enables the reader to make sense of the strange, anomalous signals in a way that the engineers in the moment were not able to.
I do this because, as a reader, I don’t enjoy the feeling of being confused: I’m not looking for a mystery when I read a writeup. If I’m reading about a series of confusing signals that engineers are looking at (e.g., traffic spikes, RPC errors), and I can’t make sense of them either, I tend to get bored. It’s just a mess of confusion.
On the other hand, if I know why these signals are happening, but the characters in the story don’t know, then that is more effective in creating tension in my mind. I want to read on to resolve the tension, to figure out how the engineers ended up diagnosing the problem.
When informing the reader about the failure mode in advance, the challenge is to avoid infecting the reader with hindsight bias. If the reader thinks, “the problem was obviously X. How could they not see it?”, then I’ve failed in the writeup. What I try to do is put the reader into the head of the people involved as much as possible: to try to convey the confusion they were experiencing in the moment, and the source of that confusion.
By enabling the reader to identify with the people involved, you can communicate to the reader how confusing the situation was to the people involved, without directly inflicting that same confusion upon them.