Making peace with “root cause” during anomaly response

We haven’t figured out the root cause yet.

Uttered by many an engineer while responding to an anomaly

One of the contributions of cognitive systems engineering is treating anomaly response as something worthy of study. Here’s how Woods and Hollnagel describe it:

In anomaly response, there is some underlying process, an engineered or physiological process which will be referred to as the monitored process, whose state changes over time. Faults disturb the functions that go on in the monitored process and generate the demand for practitioners to act to compensate for these disturbances in order to maintain process integrity―what is sometimes referred to as “safing” activities. In parallel, practitioners carry out diagnostic activities to determine the source of the disturbances in order to correct the underlying problem..

David D. Woods, Erik Hollnagel, Joint Cognitive Systems: Patterns in Cognitive Systems Engineering, Chapter 8, p71

This type of work will be instantly recognizable to anyone who has been involved in software operations work, even though the domains that cognitive systems engineering researchers initially focused on are completely different (e.g., nuclear power plants, anesthesiology, commercial aviation, space flight).

Anomaly response involves multiple people working together, coordinating on resolving a common problem. Here’s Woods and Hollnagel again, discussing an exchange between two anesthesiologists during a neurosurgery case:

The situation calls for an update to the shared model of the case and its likely trajectory in the future … The exchange is very compact and highly coded, yet it serves to update the common ground previously established at the start of the case… Interestingly, the resident and attending after the update appear to be without candidate explanations as several possibilities have been dismissed given other findings (the resident is quite explicit in this case. After describing the unexpected event, he also adds―”but no explanation”.

p93, ibid

Just like those anesthesiologists, practitioners in all domains often communicate using “compact and highly coded” jargon. I’ve seen claims that jargon is intended to obfuscate, but it’s just the opposite during anomaly response: a team that shares jargon can communicate more efficiently, because of the pre-existing shared context about the precise meaning of those terms (assuming, of course, the team members understand those terms the same way).

That brings us to the “root cause”. Let’s start with a few words about root cause analysis.

Root cause analysis (RCA) is an approach for identifying why an incident happened. It’s often associated with the Five Whys approach, associated with Toyota. Members of the resilience engineering community have been very critical of RCA. For one critical take, check out John Allspaw’s piece The Infinite Hows (or, the Dangers of The Five Whys). Allspaw makes a compelling case, and I agree with him.

What I’m arguing in this post is that the term “root cause” has a completely different connotation when used during anomaly response than when used during post-incident analysis. When, during anomaly response, an engineer says “I haven’t found the root cause”, they do not mean, “I have not yet performed a Five-Whys root cause analysis”. Instead, they mean “the signals I am observing are inconsistent with my mental model of how the system behaves”. Or, less formally, “I know something’s wrong here, but I don’t know what it is!”

When an engineer says “we don’t know the root cause yet” during anomaly response, everybody involved understands what they mean. If you were to reply “actually, there is no such thing as root cause”, the best response you could hope for is a blank stare. The engineers aren’t talking about Five-Whys in this context. Instead, they’re doing dynamic fault management. They’re trying to make sense of what’s currently happening.

Because I’m one of those folks who is critical of RCA, I used to try to encourage people to say, “we don’t understand the failure mode yet” instead of “we don’t know the root cause yet”. But I’m going to stop encouraging them, because I have come around to believing that “we don’t know the root cause” is effective jargon when coordinating during anomaly response.

The post-incident investigation context is another story entirely, and I’m still going to fight the battles against root cause during the post-incident work. But, just as I wouldn’t try to do a one-on-one interview with an engineer while they were engaged in an incident, I’m no longer going to try do get engineers to stop saying root cause while they are engaged in an incident. If the experts at anomaly response find it a useful phrase while they are doing their work, we should recognize this as a part of their expertise.

Yes, it will probably make it a little harder to get an organization to shake off the notion of “root cause” if people still freely use the term during anomaly response. And I won’t use the term myself. But it’s a battle I no longer think is worth fighting.

3 thoughts on “Making peace with “root cause” during anomaly response

  1. Well written entry, thanks. I find myself using the phrase “proximate cause” pretty often while investigating an outage. I mean it to be basically, the last straw that broke before the system changed state.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s