Imagine a colleague comes to you and says, “I’m doing the writeup for a recent incident. I have to specify causes, but I’m not sure which ones to put. Which of these do you think I should go with?”
- Engineer entered incorrect configuration value. The engineer put in the wrong config value, which led to the critical foo system to return error responses.
- Non-actionable alerts. The engineer who specified the config had just come off an on-call shift where they had to deal with multiple alerts that fired the night before. All of those alerts turned out to be non-actionable. The engineer was tired the day they put in the configuration change. Had the engineer not had to deal with these alerts, they would have been sharper the next day, and likely would have spotted the config problem.
- Work prioritization. An accidentally incorrect configuration value was a known risk to the team, and they had been planning to build in some additional verification to guard against these sorts of configuration values. But this work was de-prioritized in favor of work that supported a high-priority feature to the business, which involved coordination across multiple teams. Had the work not been de-prioritized, there would have been guardrails in place that would have prevented the config change from taking down the system.
- Power dynamics. The manager of the foo team had asked leadership for additional headcount, to enable the team to do both the work that was high priority to the business and to work on addressing known risks. However, the request was denied, and the available headcount was allocated to other teams, based on the perceived priorities of the business. If the team manager had had more power in the org, they would have been able to acquire the additional resources and address the known risks.
There’s a sense in which all of these can count as causes. If any of them weren’t present, the incident wouldn’t have happened. But we don’t see them the same way. I can guarantee that you’re never going to see power dynamics listed as a cause in an incident writeup, public or internal.
The reason is not that “incorrect configuration value” is somehow objectively more causal than power dynamics. Rather, the sorts of things that are allowed to be labelled as causes depends on the cultural norms of an organization. This is what people mean when they say that causes are socially constructed.
And who gets to determine what’s allowed to be labelled as a cause and what isn’t is itself a property of power dynamics. Because the things that are allowed to be called causes are things that an organization is willing to label as a problem, which means that it’s something that can receive organizational attention and resources in order to be addressed.
Remember this the next time you identify a contributing factor in an incident and somebody responds with, “that’s not why the incident happened.” That isn’t an objective statement of fact. It’s a value judgment about what’s permitted to be identified as a cause.
2 thoughts on “What’s allowed to count as a cause?”