April 2023 – Surfing Complexity

Here’s a brief excerpt from a talk by David Woods on what he calls the component substitution fallacy (emphasis mine):

claim of root cause is ex. of component substitution fallacy. All incidents that threaten failure reveal component weaknesses due to finite
resources & tradeoffs -> easy to miss the critical systemic/emergent factors see min 25 https://t.co/OsYy2U8fsA
— David Woods (@ddwoods2) January 16, 2023

Everybody is continuing to commit the component substitution fallacy.

Now, remember, everything has finite resources, and you have to make trade-offs. You’re under resource pressure, you’re under profitability pressure, you’re under schedule pressure. Those are real, they never go to zero.

So, as you develop things, you make trade offs, you prioritize some things over other things. What that means is that when a problem happens, it will reveal component or subsystem weaknesses. The trade offs and assumptions and resource decisions you made guarantee there are component weaknesses. We can’t afford to perfect all components.

Yes, improving them is great and that can be a lesson afterwards, but if you substitute component weaknesses for the systems-level understanding of what was driving the event … at a more fundamental level of understanding, you’re missing the real lessons.

Seeing component weaknesses is a nice way to block seeing the system properties, especially because this justifies a minimal response and avoids any struggle that systemic changes require.
Woods on Shock and Resilience (25:04 mark)

Whenever an incident happens, we’re always able to point to different components in our system and say “there was the problem!” There was a microservice that didn’t handle a certain type of error gracefully, or there was bad data that had somehow gotten past our validation checks, or a particular cluster was under-resourced because it hadn’t been configured properly, and so on.

These are real issues that manifested as an outage, and they are worth spending the time to identify and follow up on. But these problems in isolation never tell the whole story of how the incident actually happened. As Woods explains in the excerpt of his talk above, because of the constraints we work under, we simply don’t have the time to harden the software we work on to the point where these problems don’t happen anymore. It’s just too expensive. And so, we make tradeoffs, we make judgments about where to best spend our time as we build, test, and roll out our stuff. The riskier we perceive a change, the more effort we’ll spend on validation and rollout of the change.

And so, if we focus only on issues with individual components, there’s so much we miss about the nature of failure in our systems. We miss looking at the unexpected interactions between the components that enabled the failure to happen. We miss how the organization’s prioritization decisions enabled the incident in the first place. We also don’t ask questions like “if we are going to do follow-up work to fix the component problems revealed by this incident, what are the things that we won’t be doing because we’re prioritizing this instead?” or “what new types of unexpected interactions might we be creating by making these changes?” Not to mention incident-handling questions like “how did we figure out something was wrong here?”

In the wake of an incident, if we focus only on the weaknesses of individual components then we won’t see the systemic issues. And it’s the systemic will continue to bite us long after we’ve implemented all of those follow-up action items. We’ll never see the forest for the trees.

Month: April 2023

My SREcon 23 talk is up

Missing the forest for the trees: the component substitution fallacy