Why LFI is a tough sell

There are two approaches to doing post-incident analysis:

the (traditional) root cause analysis (RCA) perspective
the (more recent) learning from incidents (LFI) perspective

In the RCA perspective, the occurrence of an incident has demonstrated that there is a vulnerability that caused the incident to happen, and the goal of the analysis is to identify and eliminate the vulnerability.

In the LFI perspective, an incident presents the organization with an opportunity to learn about the system. The goal is to learn as much as possible with the time that the organization is willing to devote to post-incident work.

The RCA approach has the advantage of being intuitively appealing. The LFI approach, by contrast, has three strikes against it:

LFI requires more time and effort than RCA
LFI requires more skill than RCA
It’s not obvious what advantages LFI provides over RCA.

I think the value of LFI approach is based on assumptions that people don’t really think about because these assumptions are not articulated explicitly.

In this post, I’m going to highlight two of them.

Nobody knows how the system really works

The LFI approach makes the following assumption: No individual in the organization will ever have an accurate mental model about how the entire system works. To put it simply:

It’s the stuff we don’t know that bites us
There’s always stuff we don’t know

By “system” here, I mean the socio-technical system, which includes both the software and what it does, and the humans who do the work to develop and operate the system.

You’ll see the topic of incorrect mental models discussed in the safety literature in various ways. For example, David Woods uses the term miscalibration to describe incorrect mental models, and Diane Vaughan writes about structural secrecy, which is a mechanism that leads to incorrect mental models.

But incorrect mental models are not something we talk much about explicitly in the software world. The RCA approach implicitly assumes there’s only a single thing that we didn’t know: the root cause of the incident. Once we find that, we’re done.

To believe that the LFI approach is worth doing, you need to believe that there is a whole bunch of things about the system that people don’t know, not just a single vulnerability. And there are some things that, say, Alice knows that Bob doesn’t, and that Alice doesn’t know that Bob doesn’t know.

Better system understanding leads to better decision making in the future

The payoff for RCA is clear: the elimination of a known vulnerability. But the payoff for LFI is a lot fuzzier: if the people in the organization know more about the system, they are going to make better decisions in the future.

The problem with articulating the value is that we don’t know when these future decisions will be made. For example, the decision might happen when responding to the next incident (e.g., now I know how to use that observability tool because I learned from how someone else used it effectively in the last incident). Or the decision might happen during the design phase of a future software project (e.g., I know to shard my services by request type because I’ve seen what can go wrong when “light” and “heavy” requests are serviced by same cluster) or during the coding phase (e.g., I know to explicitly set a reasonable timeout because Java’s default timeout is way too high).

The LFI approach assumes that understanding the system better will advance the expertise of the engineers in the organization, and that better expertise means better decision making.

On the one hand, organizations recognize that expertise leads to better decision making: it’s why they are willing to hire senior engineers even though junior engineers are cheaper. On the other hand, hiring seems to be the only context where this is explicitly recognized. “This activity will advance the expertise of our staff, and hence will lead to better future outcomes, so it’s worth investing in” is the kind of mentality that is required to justify work like the LFI approach.

If you can’t tell a story about it, it isn’t real

We use stories to make sense of the world. What that means is that when events occur that don’t fit neatly into a narrative, we can’t make sense of them. As a consequence, these sorts of events are less salient, which means they’re less real.

In The Invisible Victims of American Anti-Semitism, Yair Rosenberg wrote in the Atlantic about the kinds of attacks that target Jews which don’t get much attention in the larger media. His claim is that this happens when these attacks don’t fit into existing narratives about anti-Semitism (emphasis mine):

What you’ll also notice is that all of the very real instances of anti-Semitism discussed above don’t fall into either of these baskets. Well-off neighborhoods passing bespoke ordinances to keep out Jews is neither white supremacy nor anti-Israel advocacy gone awry. Nor can Jews being shot and beaten up in the streets of their Brooklyn or Los Angeles neighborhoods by largely nonwhite assailants be blamed on the usual partisan bogeymen.

That’s why you might not have heard about these anti-Semitic acts. It’s not that politicians or journalists haven’t addressed them; in some cases, they have. It’s that these anti-Jewish incidents don’t fit into the usual stories we tell about anti-Semitism, so they don’t register, and are quickly forgotten if they are acknowledged at all.

In The 1918 Flu Faded in Our Collective Memory: We Might ‘Forget’ the Coronavirus, Too, Scott Hershberger speculated in Scientific American along similar lines about why historians paid little attention the Spanish Flu epidemic, even though it killed more people than World War I (emphasis mine):

For the countries engaged in World War I, the global conflict provided a clear narrative arc, replete with heroes and villains, victories and defeats. From this standpoint, an invisible enemy such as the 1918 flu made little narrative sense. It had no clear origin, killed otherwise healthy people in multiple waves and slinked away without being understood. Scientists at the time did not even know that a virus, not a bacterium, caused the flu. “The doctors had shame,” Beiner says. “It was a huge failure of modern medicine.” Without a narrative schema to anchor it, the pandemic all but vanished from public discourse soon after it ended.

I’m a big believer in the role of interactions, partial information, uncertainty, workarounds, tradeoffs, and goal conflicts as contributors to systems failures. I think the way to convince other people to treat these entities as first-class is to weave them into the stories we tell about how incidents happen. If we want people to see these things as real, we have to integrate them into narrative descriptions of incidents.

Because, If we can’t tell a story about something, it’s as if it didn’t happen.

When there’s no plan for this scenario, you’ve got to improvise

An incident is happening. Your distributed system has somehow managed to get itself stuck in a weird state. There’s a runbook, but because the authors didn’t foresee this failure mode ever happening, the runbook isn’t actually helpful here. To get the system back into a healthy state, you’re going to have to invent a solution on the spot.

In other words, you’re going to have to to improvise.

“We gotta find a way to make this fit into the hole for this using nothing but that.” – scene from Apollo 13

Like uncertainty, improvisation is an aspect of incident response that we typically treat as a one-off, rather than as a first-class skill that we should recognize and cultivate. Not every incident requires improvisation to resolve, but the hairiest ones will. And it’s these most complex of incidents that are the ones we need to worry about the most, because they’re the ones that are costliest to the business.

One of the criticisms of resilience engineering as a field is that it isn’t prescriptive. Often, a response I’ll hear about resilience engineering research I talk about is “OK, Lorin, that’s interesting, but what should I actually do?” I think resilience engineering is genuinely helpful, and in this case it teaches us that improvisation requires local expertise, autonomy, and effective coordination.

To improvise a solution, you have to be able to effectively use the tools and technologies that you have on hand that are directly available to you in this situation, what Claude Levi Strauss referred to as bricolage. That means you have to know what those tools are and you have to be skilled in their use. That’s the local expertise part. You’ll often need to leverage what David Woods calls a generic capability in order to solve the problem at hand. That’s some element of technology that wasn’t explicitly designed to do what you need, but is generic enough that you can use it.

Improvisation also requires that the people with the expertise have the authority to take required actions. They’re going to need the ability to do risky things, which could potentially end up making things worse. That’s the autonomy part.

Finally, because of the complex nature of incidents, you will typically need to work with multiple people to resolve things. It may be that you don’t have the requisite expertise or autonomy, but somebody else does. Or it may be that the improvised strategy requires coordination across a group of people. I remember one time when I was the incident commander where there was a problem that was affecting a large number of services and the only remediation strategy was to restart or re-deploy the affected services: we had to effectively “reboot the fleet”. The deployment tooling at the time didn’t support that sort of bulk activity, so we had to do it manually. A group of us, sitting in the war room (this was in pre-COVID days), divvied up the work of reaching out to all of the relevant service owners. We coordinated using Google sheets. (In general, I’m opposed to writing automation scripts during an incident if doing the task manually is just as quick, because the blast radius of that sort of script is huge, and those scripts generally don’t get tested well before use because of the urgency).

While we don’t know exactly what we’ll be called on to do during an incident, we can prepare to improvise. For more on this topic, check out Matt Davis’s piece on Site Reliability Engineering and the Art of Improvisation.

Treating uncertainty as a first-class concern

One of the things that complex incidents have in common is the uncertainty that the responders experience while the incident is happening. Something is clearly wrong, that’s why an incident is happening. But it’s hard to make sense of the failure mode, on what precisely the problem is, based on the signals that we can observe directly.

Eventually, we figure out what’s going on, and how to fix it. By the time the incident review rolls around, while we might not have a perfect explanation for every symptom that we witnessed during the incident, we understand what happened well enough that the in-the-moment uncertainty is long gone.

Cooperative Advocacy: An Approach for Integrating Diverse Perspectives in Anomaly Response is a paper by Jennifer Watts-Englert and David Woods that compares two incidents involving NASA space shuttle missions: one successful and the other tragic. The paper discusses strategies for dealing with the authors call anomaly response, when something unexpected has happened. The authors describe a process they observed which they call Cooperative Advocacy for effectively dealing with uncertainty during an incident. They document how cooperative advocacy was applied in the successful NASA mission, and how it was not applied in the failed case.

It’s a good paper (it’s on my list!), and SREs and anyone else who deals with incidents will find it highly relevant to their work. For example, here’s a quote from the paper that I immediately connected with:

For anomaly response to be robust given all of the difficulties and complexities that can arise, all discrepancies must be treated as if they are anomalies to be wrestled with until their implications are understood (including the implications of being uncertain or placed in a difficult trade-off position). This stance is a kind of readiness to re-frame and is a basic building block for other aspects of good process in anomaly response. Maintaining this as a group norm is very difficult because following up on discrepancies consumes resources of time, workload, and expertise. Inevitably, following up on a discrepancy will be seen as a low priority for these resources when a group or organization operates under severe workload constraints and under increasing pressure to be “faster, better, cheaper”.

(See one of my earlier blog posts, chasing down the blipperdoodles).

But the point of this blog post isn’t to summarize this specific paper. Rather, it’s to call attention to the fact that anomaly response as a problem that we will face over and over again. Too often, we dismiss the anomaly we just faced in an incident as a weird, one-off occurrence. And while that specific failure mode likely will be a one-off, we’ll be faced with new anomalies in the future.

This paper treats anomaly response as a first-class entity, as a thing we need to worry about on an ongoing basis, as something we need to be able to get better at. We should do the same.

Missing the forest for the trees: the component substitution fallacy

Here’s a brief excerpt from a talk by David Woods on what he calls the component substitution fallacy (emphasis mine):

claim of root cause is ex. of component substitution fallacy. All incidents that threaten failure reveal component weaknesses due to finite
resources & tradeoffs -> easy to miss the critical systemic/emergent factors see min 25 https://t.co/OsYy2U8fsA
— David Woods (@ddwoods2) January 16, 2023

Everybody is continuing to commit the component substitution fallacy.

Now, remember, everything has finite resources, and you have to make trade-offs. You’re under resource pressure, you’re under profitability pressure, you’re under schedule pressure. Those are real, they never go to zero.

So, as you develop things, you make trade offs, you prioritize some things over other things. What that means is that when a problem happens, it will reveal component or subsystem weaknesses. The trade offs and assumptions and resource decisions you made guarantee there are component weaknesses. We can’t afford to perfect all components.

Yes, improving them is great and that can be a lesson afterwards, but if you substitute component weaknesses for the systems-level understanding of what was driving the event … at a more fundamental level of understanding, you’re missing the real lessons.

Seeing component weaknesses is a nice way to block seeing the system properties, especially because this justifies a minimal response and avoids any struggle that systemic changes require.
Woods on Shock and Resilience (25:04 mark)

Whenever an incident happens, we’re always able to point to different components in our system and say “there was the problem!” There was a microservice that didn’t handle a certain type of error gracefully, or there was bad data that had somehow gotten past our validation checks, or a particular cluster was under-resourced because it hadn’t been configured properly, and so on.

These are real issues that manifested as an outage, and they are worth spending the time to identify and follow up on. But these problems in isolation never tell the whole story of how the incident actually happened. As Woods explains in the excerpt of his talk above, because of the constraints we work under, we simply don’t have the time to harden the software we work on to the point where these problems don’t happen anymore. It’s just too expensive. And so, we make tradeoffs, we make judgments about where to best spend our time as we build, test, and roll out our stuff. The riskier we perceive a change, the more effort we’ll spend on validation and rollout of the change.

And so, if we focus only on issues with individual components, there’s so much we miss about the nature of failure in our systems. We miss looking at the unexpected interactions between the components that enabled the failure to happen. We miss how the organization’s prioritization decisions enabled the incident in the first place. We also don’t ask questions like “if we are going to do follow-up work to fix the component problems revealed by this incident, what are the things that we won’t be doing because we’re prioritizing this instead?” or “what new types of unexpected interactions might we be creating by making these changes?” Not to mention incident-handling questions like “how did we figure out something was wrong here?”

In the wake of an incident, if we focus only on the weaknesses of individual components then we won’t see the systemic issues. And it’s the systemic will continue to bite us long after we’ve implemented all of those follow-up action items. We’ll never see the forest for the trees.

Making peace with the imperfect nature of mental models

We all carry with us in our heads models about how the world works, which we colloquially refer to as mental models. These models are always incomplete, often stale, and sometimes they’re just plain wrong.

For those of us doing operations work, our mental models include our understanding of how the different parts of the system work. Incorrect mental models are always a factor in incidents: incidents are always surprises, and surprises are always discrepancies between our mental models and reality.

There are two things that are important to remember. First, our mental models are usually good enough for us to do our operations work effectively. Our human brains are actually surprisingly good at enabling us to do this stuff. Second, while a stale mental model is a serious risk, none of us have the time to constantly verify that all of our mental models are up to date. This is the equivalent of popping up an “are you sure?” modal dialog box before taking any action. (“Are you sure that pipeline that always deploys to the test environment still deploys to test first?”)

Instead, because our time and attention is limited, we have to get good at identifying cues to indicate that our models have gotten stale or are incorrect. But, since we won’t always get these cues, it’s inevitable that our mental models will go out of date. But that’s just an inevitable part of the job when you work in a dynamic environment. And we all work in dynamic environments.

Good category, bad category (or: tag, don’t bucket)

The baby, assailed by eyes, ears, nose, skin, and entrails at once, feels it all as one great blooming, buzzing confusion; and to the very end of life, our location of all things in one space is due to the fact that the original extents or bignesses of all the sensations which came to our notice at once, coalesced together into one and the same space.
William James, The Principles of Psychology (1890)

I recently gave a talk at the Learning from Incidents in Software conference. On the one hand, I mentioned the importance of finding patterns in incidents:

But I also had some… unkind words about categorizing incidents.

We humans need categories to make sense of the world, to slice it up into chunks that we can handle cognitively. Otherwise, the world would just be, as James put it in the quote above, one great blooming, buzzing confusion. So, categorization is essential to humans functioning in the world. In particular, if we want to identify meaningful patterns in incidents, we need to do categorization work.

But there are different techniques we can use to categorize incidents, and some techniques are more productive than others.

The buckets approach

An incident must be placed in exactly one bucket

One technique is what I’ll call the buckets approach of categorization. That’s when you define a set up of categories up front, and then you assign each incident that happens to exactly one bucket. For example, you might have categories such as:

bug (new)
bug (latent)
deployment problem
third party

I have seen two issues with the bucketing approach. The first issue is that I’ve never actually seen it yield any additional insight. It can’t provide insights into new patterns because the patterns have already been predefined as the buckets. The best it can do is give you one type of filter to drill down and look at some more issues in more detail. There’s some genuine value in giving you a subset of related incidents to look more closely at, but in practice, I’ve rarely seen anybody actually do the harder work of looking at these subsets.

The second issue is that incidents, being messy things, often don’t fall cleanly into exactly one bucket. Sometimes they fall into multiple, and sometimes they don’t fall into any, and sometimes it’s just really unclear. For example, an issue may involve both a new bug and a deployment problem (as anyone who has accidentally deployed a bug to production and then gotten into even more trouble when trying to roll things back can tell you). The bucket approach forces you to discard information that is potentially useful in identifying patterns by requiring you to put the incident into exactly one bucket. This inevitably leads to arguments about whether an incident should be classified into bucket A or bucket B. This sort of argument is a symptom that this approach is throwing away useful information, and that it really shouldn’t go into a single bucket at all.

The tags approach

You may hang multiple tags on an incident

Another technique is what I’ll call the tags method of categorization. With the tags method, you annotate an incident with zero, one, or multiple tags. The idea behind tagging is that you want to let the details of the incident help you come up with meaningful categories. As incidents happen, you may come up with entirely new categories, or coalesce existing ones into one, or split them apart. Tags also let you examine incidents across multiple dimensions. Perhaps you’re interested in attributes of the people that are responding (maybe there’s a “hero” tag if there’s a frequent hero who comes in to many incidents), or maybe there’s production pressure related to some new feature being developed (in which case, you may want to label with both production-pressure and feature-name), or maybe it’s related to migration work (migration). Well, there are many different dimensions. Here are some examples of potential tags:

query-scope-accidentally-too-broad
people-with-relevant-context-out-of-office
unforeseen-performance-impact

Those example tags may seem weirdly specific, but that’s OK! The tags might be very high level (e.g., production-pressure) or very low level (e.g., pipeline-stopped-in-the-middle), or anywhere in between.

Top-down vs circular

The bucket approach is strictly top-down: you enforce a categorization on incidents from the top. The tags approach is a mix of top-down and bottom-up. When you start tagging, you’ll always start with some prior model of the types of tags that you think are useful. As you go through the details of incidents, new ideas for tags will emerge, and you’ll end up revising your set of tags over time. Someone might revisit the writeup for an incident that happened years ago, and add a new tag to it. This process of tagging incidents and identifying potential new tags categories will help you identify interesting patterns.

The tag-based approach is messier than the bucket-based one, because your collection of tags may be very heterogeneous, and you’ll still encounter situations where it’s not clear whether a tag applies or not. But it will yield a lot more insight.

Does any of this sound familiar?

The other day, David Woods responded to one of my tweets with a link to a talk:

claim of root cause is ex. of component substitution fallacy. All incidents that threaten failure reveal component weaknesses due to finite
resources & tradeoffs -> easy to miss the critical systemic/emergent factors see min 25 https://t.co/OsYy2U8fsA
— David Woods (@ddwoods2) January 16, 2023

I had planned to write a post on the component substitution fallacy that he referenced, but I didn’t even make it to minute 25 of that video before I heard something else that I had to post first, at the 7:54 mark. The context of it is describing the state of NASA as an organization at the time of the Space Shuttle Columbia accident.

And here’s what Woods says in the talk:

Now, does this describe your organization?

Are you in a changing environment under resource pressures and new performance demands?

Are you being pressured to drive down costs, work with shorter, more aggressive schedules?

Are you working with new partners and new relationships where there are new roles coming into play, often as new capabilities come to bear?

Do we see changing skillsets, skill erosion in some areas, new forms of expertise that are needed?

Is there heightened stakeholder interest?

Finally, he asks:

How are you navigating these seas of complexity that NASA confronted?

How are you doing with that?

A small mistake does not a complex systems failure make

I’ve been trying to take a break from Twitter lately, but today I popped back in, only to be trolled by a colleague of mine:

@norootcause, it looks like “they found the root cause” of that FAA outage 😜 https://t.co/dBQzaVnGPk
— Bertrand MT (@bertrandmt) January 15, 2023

Here’s a quote from the story:

The source of the problem was reportedly a single engineer who made a small mistake with a file transfer.

Here’s what I’d like you to ponder, dear reader. Think about all of the small mistakes that happen every single day, at every workplace, on the face of the planet. If a small mistake was sufficient to take down a complex system, then our systems would be crashing all of the time. And, yet, that clearly isn’t the case. For example, before this event, when was the last time the FAA suffered a catastrophic outage?

Now, it might be the case that no FAA employees have ever made a small mistake until now. Or, more likely, the FAA system works in such a way that small mistakes are not typically sufficient for taking down the entire system.

To understand this failure mode, you need to understand how it is that the FAA system is able to stay up and running on a day-to-day basis, despite the system being populated by fallible human beings who are capable of making small mistakes. You need to understand how the system actually works in order to make sense of how a large-scale failure can happen.

Now, I’ve never worked in the aviation industry, and consequently I don’t have domain knowledge about the FAA system. But I can tell you one thing: a small mistake with a file transfer is a hopelessly incomplete explanation for how the FAA system actually failed.

The Allspaw-Collins effect

While reading Laura Maguire’s PhD dissertation, Controlling the Costs of Coordination in Large-scale Distributed Software Systems, when I came across a term I hadn’t heard before: the Allspaw-Collins effect:

An example of how practitioners circumvent the extensive costs inherent in the Incident Command model is the Allspaw-Collins effect (named for the engineers who first noticed the pattern). This is commonly seen in the creation of side channels away from the group response effort, which is “necessary to accomplish cognitively demanding work but leaves the other participants disconnected from the progress going on in the side channel (p.81)”

Here’s my understanding:

A group of people responding to an incident have to process a large number of signals that are coming in. One example of such signals is the large number of Slack messages appearing in the incident channel as incident responders and others provide updates and ask questions. Another example would be additional alerts firing.

If there’s a designated incident commander (IC) who is responsible for coordination, the IC can become a bottleneck if they can’t keep up with the work of processing all of these incoming signals.

The effect captures how incident responders will sometimes work around this bottleneck by forming alternate communication channels so they can coordinate directly with each other, without having to mediate through the IC. For example, instead of sending messages in the main incident channel, they might DM each other or communicate in a separate (possibly private) channel.

I can imagine how this sort of side-channel communication would be officially sanctioned (“all incident-related communication should happen in the incident response channel!“), and also how it can be adaptive.

Maguire doesn’t give the first names of the people the effect is named for, but I strongly suspect they are John Allspaw and Morgan Collins.