Incidents make us feel uncomfortable. They remind us that we don’t have control, that the system can behave in ways that we didn’t expect. When an incident happens, the world doesn’t make sense.
A natural reaction to an incident is an effort to identify how the incident could have been avoided. The term for this type of effort is counterfactual reasoning. It refers to thinking about how, if the people involved had taken different actions, events would have unfolded differently. Here are two examples of counterfactuals:
- If the engineer who made the code change had written a test for feature X, then the bug would never have made its way into production.
- If the team members had paid attention to the email alerts that had fired, they would have diagnosed the problem much sooner.
Counterfactual reasoning is comforting because it restores the feeling that the world makes sense. What felt like a surprise is, in fact, perfectly comprehensible. What’s more, it could even have been avoided, if only we had taken the right actions and paid attention to the right signals.
While counterfactual reasoning helps restore our feeling that the world makes sense, the problem with it is that it doesn’t help us get better at avoiding or dealing with future incidents. The reason it doesn’t help is that counterfactual reasoning gives us an excuse to avoid the messy problem of understanding how we missed those obvious-in-retrospect actions and signals in the first place.
It’s one thing to say “they should have written a test for feature X”. It’s another thing to understand the rationale behind the engineer not writing that test. For example:
- Did they believe that this functionality was already tested in the existing test suite?
- Were they not aware of the existence of the feature that failed?
- Were they under time pressure to get the code pushed into production (possibly to mitigate an ongoing issue)?
Similarly, saying “they should have paid closer to attention to the email alerts” means you might miss the fact that the email alert in question isn’t actionable 90% of the time, and so the team has conditioned themselves to ignore it.
To get better at avoiding or mitigating future incidents, you need to understand the conditions that enabled past incidents to occur. Counterfactual reasoning is actively harmful for this, because it circumvents inquiry into those conditions. It replaces “what were the circumstances that led to person X taking action Y” with “person X should have done Z instead of Y”.
Counterfactual reasoning is only useful if you have a time machine and can go back to prevent the incident that just happened. For the rest of us who don’t have time machines, counterfactual reasoning helps us feel better, but it doesn’t make us better at engineering and operating our systems. Instead, it actively prevents us from getting better.
Don’t ask “why didn’t they do Y instead of X?” Instead, ask, “how was it that doing X made sense to them at the time?” You’ll learn a lot more about the world if you ask questions about what did happen instead of focusing on what didn’t.