Labeling a root cause is predicting the future, poorly

Why do we retrospect on our incidents? Why spend the time doing those write-ups and holding review meetings? We don’t do this work as some sort of intellectual exercise for amusement. Rather, we believe that if we spend the time to understand how the incident happened, we can use that insight to improve the system in general, and availability in particular. We improve availability by preventing incidents as well as reducing the impact of incidents that we are unable to prevent. This post-incident work should help us do both.

The typical approach to post-incident work is to do a root cause analysis (RCA). The idea of an RCA is to go beyond the surface-level symptoms to identify and address the underlying problems revealed by the incident. After all, it’s only by getting at the root at the problem that we will be able to permanently address it. When doing an RCA, when we attach the label root cause to something, we’re making a specific claim. That claim is: we should focus our attention on the issues that we’ve labeled “root cause”, because spending our time addressing these root causes will yield the largest improvements to future availability. Sure, it may be that there were a number of different factors involved in the incident, but we should focus on the root cause (or, sometimes, a small number of root causes), because those are the ones that really matter. Sure, the fact that Joe happened to be on PTO that day, and he’s normally the one that spots these sorts of these problems early, that’s interesting, but it isn’t the real root cause.

Remember that an RCA, like all post-incident work, is supposed to be about improving future outcomes. As a consequence, a claim about root cause is really a prediction about future incidents. It says that of all of the contributing factors to an incident, we are able to predict which factor is most likely to lead to an incident in the future. That’s quite a claim to make!

Here’s the thing, though. As our history of incidents teaches us over and over again, we aren’t able to predict how future incidents will happen. Sure, we can always tell a compelling story of why an incident happened, through the benefit of hindsight. But that somehow never translates into predictive power: we’re never able to tell a story about the next incident the way we can about the last one. After all, if we were as good at prediction as we are at hindsight, we wouldn’t have had that incident in the first place!

A good incident retrospective can reveal a surprisingly large number of different factors that contributed to the incident, providing signals for many different kinds of risks. So here’s my claim: there’s no way to know which of those factors is going to bite you next. You simply don’t possess a priori knowledge about which factors you should pay more attention to at the time of the incident retrospective, no matter what the vibes tell you. Zeroing in on a small number of factors will blind you to the role that the other factors might play in future incidents. Today’s “X wasn’t the root cause of incident A” could easily be tomorrow’s “X was the root cause of incident B”. Since you can’t predict which factors will play the most significant roles in future incidents, it’s best to cast as wide a net as possible. The more you identify, the more context you’ll have about the possible risks. Heck, maybe something that only played a minor role in this incident will be the trigger in the next one! There’s no way to know.

Even if you’re convinced that you can identify the real root cause of the last incident, it doesn’t actually matter. The last incident already happened, there’s no way to prevent it. What’s important is not the last incident, but the next one: we’re looking at the past only as a guide to help us improve in the future. And while I think incidents are inherently unpredictable, here’s a prediction I’m comfortable making: your next incident is going to be a surprise, just like your last one was, and the one before that. Don’t fool yourself into thinking otherwise.

6 thoughts on “Labeling a root cause is predicting the future, poorly

  1. I’m conflicted reading this post, as it seems that the main takeaway is “RCAs are useless because predicting the future is doomed to fail”.

    I mean, I understand you can be overconfident with an RCA, but that basically applies to anything remotely error-prone or human-made.

    More importantly, if you try predicting often enough, you may be wrong most of the time, but you might right some of the time, and that can lead to durable progress. But progress can only happen if you do try some times.

    And so I think it is with RCAs, or any reaction to feedback in general: you try to learn from events and to improve things in the process.

    Building things is very much trying to predict the future: it fails often enough, but clearly not always. And maybe there’s a ceiling to progress, but we won’t reach it if we don’t even try.

    1. My belief is that we’ll do a better job at improving if we assume that all future incidents will be surprises, and therefore we focus on getting better at generally dealing with surprise.

  2. There is a profound difference between planning and preparation. A solid RCA with contributing factors can lead to useful corrective and improvement actions. Solid investigations are historical research that can inform the present and future. We still study ancient military history to inform us how to fight and win a cyber and space war. Use the information to plan safe work and how to prepare for the next incident.

    1. Of course, we want to study our incidents! But there are better alternatives to the RCA approach. To extend the military analogy, you don’t want to keep fighting the last war.

  3. I think rather that striving to fix all root causes you also need to look at probability and impact (classic risk assessment). And focus your resources appropriately. But also be aware that these may change.

    Example: We detected an issue which occurs when a customer places an order during a daylight savings event. Given these only happen twice per year this impacts a tiny number of orders.

    However, we’ve since expanded to cover multiple timezones and are seeing a steady increase of orders. Hence what was a low probability, low impact issue is now medium probability, medium impact.

    Using your example, there’s a high probability that one of the team might be on vacation or unavailable at anyone time. So having a system that relies on just one person is a higher risk and perhaps that’s what needs fixing rather than the system issue.

Leave a comment