When an incident happens, the temptation is strong to identify a single cause. It’s as if the system is a chain, and we’re looking for the weak link that was responsible for the chain breaking. But, in organizations that are going concerns, that isn’t how the system works. It can’t be, because there are simply too many things that can and do go wrong. Think of all the touch points in your system, how many opportunities there are for problems (bugs, typo in config, bad data, …). If any one of these was enough to take down the system, then it would be like a house of cards, falling down all of the time.
What happens in successful organizations as that the system evolves layers of defense, so that it can survive the kinds of individual problems that are always cropping up. Sure, the system still goes down, and more often than we would like. But the uptime is good enough that the company continues to survive.
Here’s an analogy that I’m borrowing from John Allspaw. Think about a significant new feature or service that your organization delivered successfully: one that took multiple quarters and required the collaboration of multiple teams. I’d wager that there were many factors that contributed to the success of this effort. Imagine if someone asked you: “what was the root cause for the success of this feature?”
So it is with incidents. Because an organization can’t prevent the occurrence of individual problems, the system evolves defenses to protect itself, created by the everyday work of the people in the company. Sure, the code we write might not even compile on the first try, but somehow the code that made it out to production is running well enough that the company is still in business. People are doing checks on the system all of the time, and most of this work is invisible.
For an incident to happen, multiple factors must have contributed to penetrate those layers of defenses that have evolved. I say that with confidence, because if a single event could take your system down, then it never would have made it this far to begin with. That’s why, when you dig into an incident, you’ll always find those multiple contributors.