Root cause of failure, root cause of success

Here are a couple of tweets from John Allspaw.

Thought exercise: locate an event that is deemed widely to be a *success* in your organization, and then look for the single "root cause" of that success. You will include this "root cause" in a company-wide report, so make sure it's specific enough to defend and support.
— John Allspaw (@allspaw) August 10, 2018

That’s the point of the thought exercise. 🙂 Finding a single “root cause” of a failure is the same as finding a single “root cause” of a success — subject to all pitfalls in doing so. 🙂
— John Allspaw (@allspaw) September 4, 2018

Succeeding at a project in an organization is like pushing a boulder up a hill that is too heavy for any single person to lift.

A team working together to successfully move a boulder to the top of the hill

It doesn’t make sense to ask what the “root cause of success” is for an effort like this, because it’s a collaboration that requires the work of many different people to succeed. It’s not meaningful to single out a particular individual as the reason the boulder made it to the top.

Now, let’s imagine that the team got the boulder to the top of the hill, and balanced it precariously at the summit, maybe with some supports to keep it from tumbling down again.

Next, imagine that there’s a nearby baseball field, and some kid whacks a fly ball that strikes one of the supports, and the rock tumbles down.

In comes the ball, down goes the boulder

This, I think, is how people tend to view failure in systems. A perturbation comes along, strikes the system, and the system falls over. We associate the root cause with this perturbation.

In a way, our systems are like a boulder precariously balanced at the top of a hill. But this view is incomplete. Because what’s keeping the complex system boulder balanced is not a collection of passive supports. Instead, there are a number of active processes, like a group of people that are constantly watching the boulder to see if it starts to slip, and applying force to keep it balanced.

A collection of people watching the boulder and pushing on it to keep it from falling

Any successful complex system will have evolved these sorts of dynamic processes. These are what keep the system from falling over every time a kid hits a stray ball.

Note that it’s not the case that all of these processes have to be working for the boulder to stay up. The boulder won’t fall just because someone let their guard down for a moment, or even if one person happened to be absent one day; the boulder would never stay up if it required everyone to behave perfectly all of the time. Because it’s a group of people keeping it balanced, there is redundancy: one person can compensate for another person who falters.

But this keeping-the-boulder-balanced system isn’t perfect. Maybe something comes out of the sky and strikes the boulder with an enormous amount of force. Or maybe several people are sluggish today because they’re sick. Or maybe it rained and the surface of the hill is much slipperier, making it more difficult to navigate. Maybe it’s a combination of all of these.

When the boulder falls, it means that the collection of processes weren’t able to compensate for the disturbance. But there’s no single problem, no root cause, that you can point to, because it’s the collection of these processes working together that normally keep the boulder up.

This is why “root cause of failure” doesn’t make sense in the context of complex systems failure, because a collection of control processes keep the system up and running. A system failure is a failure of this overall set of processes. It’s just not meaningful to single out a problem with one of these processes after an incident, because that process is just one of many, and it failing alone couldn’t have brought down the system.

What makes things even trickier is that some of these processes are invisible, even to the people inside of the system. We don’t see the monitoring and adjustment that is going on around us. Which means we won’t notice if some of these control processes stop happening.

9 thoughts on “Root cause of failure, root cause of success”

Dick says:

August 14, 2021 at 10:14 pm

It seems like you’ve rediscovered Cook’s “How Complex Systems Fail” but this time with cartoons and strange metaphors.

Tawanda says:

August 15, 2021 at 12:26 am

Interesting read. I’m reminded of this: https://how.complexsystems.fail/

Tawanda Moyo says:

August 15, 2021 at 12:28 am

Interesting read. I’m reminded of this: https://how.complexsystems.fail/

1. Mamjja, J.F. says:
  
  September 6, 2021 at 7:42 pm
  
  Thanks for the link Tawanda. It was straightforward and informative.
  
Pingback: SRE Weekly Issue #285 – SRE WEEKLY
marioharvey says:

August 30, 2021 at 9:27 am

I understand the whole “it takes a village” idea but you can definitely do a root cause analysis on success if you wanted. Of course, if you make a broad ask like “what’s the root cause of your organizations success” that would be difficult to answer. However, if you narrowed it down to a feature, a release, or specific event you can definitely understand who were the key players and what made it successful. We do this all the time in retrospectives to understand what went well and what didn’t. So the same applies to failure, you can definitely understand what caused a certain failure and how to mitigate. However, you can really usually only address the symptoms and if the failure applies to some larger organizational or cultural issue those can be harder to identify or or even see due to bias blindness. But overall, I think both success and failure can be analyzed in meaningful and helpful ways.

Eric Brown says:

September 6, 2021 at 3:13 pm

I’d also add Charles Perrow’s _Normal Accidents_.

David L Ambrose says:

September 7, 2021 at 11:03 am

I’m reminded of the analysis of some major failure, like an oil refinery explosion, You get into them, and you find the failure is caused by several things that shouldn’t have happened. If only one of the factors had taken place, there wouldn’t have been an adverse event.

It’s difficult to identify these things in advance, mostly because we can’t adequately model the system at a reasonable cost. Your best defense is to build a culture where everyone is supportive when things go wrong. This includes management, who should understand that their job is to be a firewall for their people and the designated @sshole when it comes to bad news. It’s rare to find both these traits in a manager, and exceedingly rare to find them in an entire organization.

Pingback: SRE Weekly Issue #285 – FDE