The courage to imagine other failures

All other things being equal, what’s more expensive for your business: a fifteen-minute outage or an eight-hour outage? If you had to pick one, which would you pick? Hold that thought.

Imagine that you work for a company that provides a software service over the internet. A few days ago, your company experienced an incident where the service went down for about four hours. Executives at the company are pretty upset about what happened: “we want to make certain this never happens again” is a phrase you’ve heard several times.

The company held a post-incident review, and the review process identified a number of actions items to prevent a recurrence of the incident. Some of this follow-up work has already been completed, but there other items that are going to take your team a significant amount of time and effort. You already had a decent backlog of reliability work that you had been planning on knocking out this quarter, but this incident has put this other work onto the back burner.

One night, the Oracle of Delphi appears to you in a dream.

Priestess of Delphi (1891) by John Collier

The Oracle tells you that if you prioritize the incident follow-up work, then in a month your system is going to suffer an even worse outage, one that is eight hours long. The failure mode for this outage will be very different from the last one. Ironically, one of the contributors to this outage will be an unintended change in system behavior that was triggered by the follow-up work. Another contributor to this incident was a known risk to the system that you were working on addressing, but that you had put off to the future after the incident changed priorities.

She goes on to tell you that if you instead do the reliability work that was on your backlog, you will avoid this outage. However, your system will instead experience a fifteen minute outage, with a failure mode that was very similar to the one you recently experienced. The impact will be much smaller because of the follow-up work that had already been completed, as well as the engineers now being more experienced with this type of failure.

Which path do you choose: the novel eight-hour outage, or the “it happened again!” fifteen minute outage?

By prioritizing doing preventative work from recent incidents, we are implicitly assuming that a recent incident is the one most likely to bite us again in the future. It’s important to remember that this is an illusion: we feel like the follow-up work is the most important thing we can do for reliability because we have a visceral sense of the incident we just went through. It’s much more real to us than a hypothetical, never-happened-before future incident. Unfortunately, we only have a finite amount of resources to spend on reliability work, and our memory of the recent incident does not mean that the follow-up work is the reliability work which will provide the highest return on investment.

In real life, we are never granted perfect information about the future consequences of our decisions. We have only our own judgment to guide us on how we should prioritize our work based on the known risks. Always prioritizing the action items from the last big incident is the easy path. The harder one is imagining the other types of incidents that might happen in the future, and recognizing that those might actually be worse than a recurrence. After all, you were surprised before. You’re going to be surprised again. That’s the real generalizable lesson of that last big incident.

Any change can break us, but we can’t treat every change the same

Here are some excerpts from an incident story told by John Allspaw about his time at Etsy (circa 2012), titled Learning Effectively From Incidents: The Messy Details.

In this story, the site goes down:

September 2012 afternoon, this is a tweet from the Etsy status account saying that there’s an issue on the site… People said, oh, the site’s down. People started noticing that the site is down.

Possibly the referenced issue?

This is a tough outage: the web servers are down so hard that they aren’t even reachable:

And people said, well, actually it’s going to be hard to even deploy because we can’t even get to the servers. And people said, well, we can barely get them to respond to a ping. We’re going to have to get people on the console, the integrated lights out for hard reboots. And people even said, well, because we’re talking about hundreds of web servers. Could it be faster, we could even just power cycle these. This is a big deal here. So whatever it wasn’t in the deploy that caused the issue, it made hundreds of web servers completely hung, completely unavailable.

One of the contributors? A CSS change to remove support for old browsers!

And one of the tasks was with the performance team and the issue was old browsers. You always have these workarounds because the internet didn’t fulfill the promise of standards. So, let’s get rid of the support for IE version seven and older. Let’s get rid of all the random stuff. …
And in this case, we had this template-based template used as far as we knew everything, and this little header-ie.css, was the actual workaround. And so the idea was, let’s remove all the references to this CSS file in this base template and we’ll remove the CSS file.

How does a CSS change contribute to a major outage?

The request would come in for something that wasn’t there, 404 would happen all the time. The server would say, well, I don’t have that. So I’m going to give you a 404 page and so then I got to go and construct this 404 page, but it includes this reference to the CSS file, which isn’t there, which means I have to send a 404 page. You might see where I’m going back and forth, 404 page, fire a 404 page, fire a 404 page. Pretty soon all of the 404s are keeping all of the Apache servers, all of the Apache processes across hundreds of servers hung, nothing could be done.

I love this story because a CSS change feels innocuous. CSS just controls presentation, right? How could that impact availability? From the story (emphasis mine)

And this had been tested and reviewed by multiple people. It’s not all that big of a deal of a change, which is why it was a task that was sort of slated for the next person who comes through boot camp in the performance team.

The reason a CSS change can cascade into an outage is that in a complex system there are all of these couplings that we don’t even know are there until we get stung by them.

One lesson you might take away from this story is “you should treat every proposed change like it could bring down the entire system”. But I think that’s the wrong lesson. The reason I think so is because of another constraint we all face: finite resources. Perhaps in a world where we always had an unlimited amount of time to make any changes, we could take this approach. But we don’t live in that world. We only have a fixed number of hours in a week, which means we need to budget our time. And so we make judgment calls on how much time we’re going to spend on manually validating a change based on how risky we perceive that change to be. When I review someone else’s pull request, for example, the amount of effort I spend on it is going to vary based on the nature of the change. For example, I’m going to look more closely at changes to database schemas than I am to changes in log messages.

But that means that we’re ultimately going to miss some of these CSS-change-breaks-the-site kinds of changes. It’s fundamentally inevitable that this is going to happen: it’s simply in the nature of complex systems. You can try to add process to force people to scrutinize every change with the same level of effort, but unless you remove schedule pressure, that’s not going to have the desired effect. People are going to make efficiency-thoroughness tradeoffs because they are held accountable for hitting their OKRs, and they can’t achieve those OKR if they put in the same amount of effort to evaluate every single production change.

Given that we can’t avoid such failures, the best we can do is to be ready to respond to them.

“Human error” means they don’t understand how the system worked

One of the services that the Amazon cloud provides is called S3, which is a data storage service. Imagine a hypothetical scenario where S3 had a major outage, and Amazon’s explanation of the outage was “a hard drive failed”.

Engineers wouldn’t believe this explanation. It’s not that they would doubt that a hard drive failed; we know that hard drives fail all of the time. In fact, it’s precisely because hard drives are prone to failure, and S3 stays up, that they wouldn’t accept this as an explanation. S3 has been architected to function correctly even in the face of individual hard drives failing. While a failed hard drive could certainly be a contributor to an outage, it can’t be the whole story. Otherwise, S3 would constantly be going down. To say “S3 went down because a hard drive failed” is to admit “I don’t know how S3 normally works when it experiences hard drive failures”.

We accept “human error” as the explanation for failures of reliable systems. Now, I’m a bit of an extremist when it comes to the idea of human error, I believe it simply doesn’t exist. But, let’s put that aside for now, and assume that human error is a real thing, and people make mistakes. The thing is, humans are constantly making mistakes. Every day, in every organization, there are many people that are making many mistakes. The people who work on systems that stay up most of the time are not some sort of hyper-vigilant super-humans that make fewer mistakes than the rest of us. Rather, these people are embedded within systems that have evolved over time to be resistant to these sorts of individual mistakes.

As the late Dr. Richard Cook (no fan of the concept of “human error” himself) put it in How Complex Systems Fail: Complex systems are heavily and successfully defended against failure”. As a consequence of this, “Catastrophe requires multiple failures – single point failures are not enough.”

Reliable systems are error-tolerant. There are mechanisms within such systems to guard against the kinds of mistakes that people make on a regular basis. Ironically, these mechanisms are not necessarily designed into the system: they can evolve organically and invisibly. But they are there, and they are the reason that these systems stay up day after day.

What this means is that when someone attributes a failure to “human error”, it means that they do not see these defenses in the system, and so they don’t actually have an understanding of how all of these defenses failed in this scenario. When you hear “human error” as an explanation for why a system failed, you should think “this person doesn’t know how the system stays up.” Because without knowing how the system stays up, it is impossible to understand the cases where it comes down.

(I believe Cook himself said something to the effect of “human error is the point where they stopped asking questions”).

Normal incidents

In 1984, the late sociologist Charles Perrow published the book: Normal Accidents: Living with High-Risk Technologies. In this book, he proposed a theory that accidents were unavoidable in systems that had certain properties, and nuclear power plants had these properties. In such systems, accidents would inevitably occur during the normal course of operations.

You don’t hear much about Perrow’s Normal Accident Theory these days, as it has been superseded by other theories in safety science, such as High Reliability Organizations and Resilience Engineering (although see Hopkins’s 2013 paper Issues in safety science for criticisms of all three theories). But even rejecting the specifics of Perrow’s theory, the idea of a normal accident or incident is a useful one.

An incident is an abnormal event. Because of this, we assume, reasonably, that an incident must have an abnormal cause: something must have gone wrong in order for the incident to have happened. And so we look to find where the abnormal work was, where it was that someone exercised poor judgment that ultimately led to the incident.

But incidents can happen as a result of normal work, when everyone whose actions contributed to the incident was actually exercising reasonable judgment at the time they committed those actions.

This concept, that all actions and decisions that contributed to an incident were reasonable in the moment they were made, is unintuitive. It requires a very different conceptual model of how incidents happen. But, once you adopt this conceptual model, it completely changes the way you understand incidents. You shift from asking “what was the abnormal work?” to “how did this incident happen even though everyone was doing normal work?” And this yields very different insights into how the system actually works, how it is that incidents don’t usually happen due to normal work, and how it is that they occasionally do.

Why LFI is a tough sell

There are two approaches to doing post-incident analysis:

  • the (traditional) root cause analysis (RCA) perspective
  • the (more recent) learning from incidents (LFI) perspective

In the RCA perspective, the occurrence of an incident has demonstrated that there is a vulnerability that caused the incident to happen, and the goal of the analysis is to identify and eliminate the vulnerability.

In the LFI perspective, an incident presents the organization with an opportunity to learn about the system. The goal is to learn as much as possible with the time that the organization is willing to devote to post-incident work.

The RCA approach has the advantage of being intuitively appealing. The LFI approach, by contrast, has three strikes against it:

  1. LFI requires more time and effort than RCA
  2. LFI requires more skill than RCA
  3. It’s not obvious what advantages LFI provides over RCA.

I think the value of LFI approach is based on assumptions that people don’t really think about because these assumptions are not articulated explicitly.

In this post, I’m going to highlight two of them.

Nobody knows how the system really works

The LFI approach makes the following assumption: No individual in the organization will ever have an accurate mental model about how the entire system works. To put it simply:

  • It’s the stuff we don’t know that bites us
  • There’s always stuff we don’t know

By “system” here, I mean the socio-technical system, which includes both the software and what it does, and the humans who do the work to develop and operate the system.

You’ll see the topic of incorrect mental models discussed in the safety literature in various ways. For example, David Woods uses the term miscalibration to describe incorrect mental models, and Diane Vaughan writes about structural secrecy, which is a mechanism that leads to incorrect mental models.

But incorrect mental models are not something we talk much about explicitly in the software world. The RCA approach implicitly assumes there’s only a single thing that we didn’t know: the root cause of the incident. Once we find that, we’re done.

To believe that the LFI approach is worth doing, you need to believe that there is a whole bunch of things about the system that people don’t know, not just a single vulnerability. And there are some things that, say, Alice knows that Bob doesn’t, and that Alice doesn’t know that Bob doesn’t know.

Better system understanding leads to better decision making in the future

The payoff for RCA is clear: the elimination of a known vulnerability. But the payoff for LFI is a lot fuzzier: if the people in the organization know more about the system, they are going to make better decisions in the future.

The problem with articulating the value is that we don’t know when these future decisions will be made. For example, the decision might happen when responding to the next incident (e.g., now I know how to use that observability tool because I learned from how someone else used it effectively in the last incident). Or the decision might happen during the design phase of a future software project (e.g., I know to shard my services by request type because I’ve seen what can go wrong when “light” and “heavy” requests are serviced by same cluster) or during the coding phase (e.g., I know to explicitly set a reasonable timeout because Java’s default timeout is way too high).

The LFI approach assumes that understanding the system better will advance the expertise of the engineers in the organization, and that better expertise means better decision making.

On the one hand, organizations recognize that expertise leads to better decision making: it’s why they are willing to hire senior engineers even though junior engineers are cheaper. On the other hand, hiring seems to be the only context where this is explicitly recognized. “This activity will advance the expertise of our staff, and hence will lead to better future outcomes, so it’s worth investing in” is the kind of mentality that is required to justify work like the LFI approach.

If you can’t tell a story about it, it isn’t real

We use stories to make sense of the world. What that means is that when events occur that don’t fit neatly into a narrative, we can’t make sense of them. As a consequence, these sorts of events are less salient, which means they’re less real.

In The Invisible Victims of American Anti-Semitism, Yair Rosenberg wrote in the Atlantic about the kinds of attacks that target Jews which don’t get much attention in the larger media. His claim is that this happens when these attacks don’t fit into existing narratives about anti-Semitism (emphasis mine):

What you’ll also notice is that all of the very real instances of anti-Semitism discussed above don’t fall into either of these baskets. Well-off neighborhoods passing bespoke ordinances to keep out Jews is neither white supremacy nor anti-Israel advocacy gone awry. Nor can Jews being shot and beaten up in the streets of their Brooklyn or Los Angeles neighborhoods by largely nonwhite assailants be blamed on the usual partisan bogeymen.

That’s why you might not have heard about these anti-Semitic acts. It’s not that politicians or journalists haven’t addressed them; in some cases, they have. It’s that these anti-Jewish incidents don’t fit into the usual stories we tell about anti-Semitism, so they don’t register, and are quickly forgotten if they are acknowledged at all.

In The 1918 Flu Faded in Our Collective Memory: We Might ‘Forget’ the Coronavirus, Too, Scott Hershberger speculated in Scientific American along similar lines about why historians paid little attention the Spanish Flu epidemic, even though it killed more people than World War I (emphasis mine):

For the countries engaged in World War I, the global conflict provided a clear narrative arc, replete with heroes and villains, victories and defeats. From this standpoint, an invisible enemy such as the 1918 flu made little narrative sense. It had no clear origin, killed otherwise healthy people in multiple waves and slinked away without being understood. Scientists at the time did not even know that a virus, not a bacterium, caused the flu. “The doctors had shame,” Beiner says. “It was a huge failure of modern medicine.” Without a narrative schema to anchor it, the pandemic all but vanished from public discourse soon after it ended.

I’m a big believer in the role of interactions, partial information, uncertainty, workarounds, tradeoffs, and goal conflicts as contributors to systems failures. I think the way to convince other people to treat these entities as first-class is to weave them into the stories we tell about how incidents happen. If we want people to see these things as real, we have to integrate them into narrative descriptions of incidents.

Because, If we can’t tell a story about something, it’s as if it didn’t happen.

When there’s no plan for this scenario, you’ve got to improvise

An incident is happening. Your distributed system has somehow managed to get itself stuck in a weird state. There’s a runbook, but because the authors didn’t foresee this failure mode ever happening, the runbook isn’t actually helpful here. To get the system back into a healthy state, you’re going to have to invent a solution on the spot.

In other words, you’re going to have to to improvise.

“We gotta find a way to make this fit into the hole for this using nothing but that.” – scene from Apollo 13

Like uncertainty, improvisation is an aspect of incident response that we typically treat as a one-off, rather than as a first-class skill that we should recognize and cultivate. Not every incident requires improvisation to resolve, but the hairiest ones will. And it’s these most complex of incidents that are the ones we need to worry about the most, because they’re the ones that are costliest to the business.

One of the criticisms of resilience engineering as a field is that it isn’t prescriptive. Often, a response I’ll hear about resilience engineering research I talk about is “OK, Lorin, that’s interesting, but what should I actually do?” I think resilience engineering is genuinely helpful, and in this case it teaches us that improvisation requires local expertise, autonomy, and effective coordination.

To improvise a solution, you have to be able to effectively use the tools and technologies that you have on hand that are directly available to you in this situation, what Claude Levi Strauss referred to as bricolage. That means you have to know what those tools are and you have to be skilled in their use. That’s the local expertise part. You’ll often need to leverage what David Woods calls a generic capability in order to solve the problem at hand. That’s some element of technology that wasn’t explicitly designed to do what you need, but is generic enough that you can use it.

Improvisation also requires that the people with the expertise have the authority to take required actions. They’re going to need the ability to do risky things, which could potentially end up making things worse. That’s the autonomy part.

Finally, because of the complex nature of incidents, you will typically need to work with multiple people to resolve things. It may be that you don’t have the requisite expertise or autonomy, but somebody else does. Or it may be that the improvised strategy requires coordination across a group of people. I remember one time when I was the incident commander where there was a problem that was affecting a large number of services and the only remediation strategy was to restart or re-deploy the affected services: we had to effectively “reboot the fleet”. The deployment tooling at the time didn’t support that sort of bulk activity, so we had to do it manually. A group of us, sitting in the war room (this was in pre-COVID days), divvied up the work of reaching out to all of the relevant service owners. We coordinated using Google sheets. (In general, I’m opposed to writing automation scripts during an incident if doing the task manually is just as quick, because the blast radius of that sort of script is huge, and those scripts generally don’t get tested well before use because of the urgency).

While we don’t know exactly what we’ll be called on to do during an incident, we can prepare to improvise. For more on this topic, check out Matt Davis’s piece on Site Reliability Engineering and the Art of Improvisation.

Treating uncertainty as a first-class concern

One of the things that complex incidents have in common is the uncertainty that the responders experience while the incident is happening. Something is clearly wrong, that’s why an incident is happening. But it’s hard to make sense of the failure mode, on what precisely the problem is, based on the signals that we can observe directly.

Eventually, we figure out what’s going on, and how to fix it. By the time the incident review rolls around, while we might not have a perfect explanation for every symptom that we witnessed during the incident, we understand what happened well enough that the in-the-moment uncertainty is long gone.

Cooperative Advocacy: An Approach for Integrating Diverse Perspectives in Anomaly Response is a paper by Jennifer Watts-Englert and David Woods that compares two incidents involving NASA space shuttle missions: one successful and the other tragic. The paper discusses strategies for dealing with the authors call anomaly response, when something unexpected has happened. The authors describe a process they observed which they call Cooperative Advocacy for effectively dealing with uncertainty during an incident. They document how cooperative advocacy was applied in the successful NASA mission, and how it was not applied in the failed case.

It’s a good paper (it’s on my list!), and SREs and anyone else who deals with incidents will find it highly relevant to their work. For example, here’s a quote from the paper that I immediately connected with:

For anomaly response to be robust given all of the difficulties and complexities that can arise, all discrepancies must be treated as if they are anomalies to be wrestled with until their implications are understood (including the implications of being uncertain or placed in a difficult trade-off position). This stance is a kind of readiness to re-frame and is a basic building block for other aspects of good process in anomaly response. Maintaining this as a group norm is very difficult because following up on discrepancies consumes resources of time, workload, and expertise. Inevitably, following up on a discrepancy will be seen as a low priority for these resources when a group or organization operates under severe workload constraints and under increasing pressure to be “faster, better, cheaper”.

(See one of my earlier blog posts, chasing down the blipperdoodles).

But the point of this blog post isn’t to summarize this specific paper. Rather, it’s to call attention to the fact that anomaly response as a problem that we will face over and over again. Too often, we dismiss the anomaly we just faced in an incident as a weird, one-off occurrence. And while that specific failure mode likely will be a one-off, we’ll be faced with new anomalies in the future.

This paper treats anomaly response as a first-class entity, as a thing we need to worry about on an ongoing basis, as something we need to be able to get better at. We should do the same.

Missing the forest for the trees: the component substitution fallacy

Here’s a brief excerpt from a talk by David Woods on what he calls the component substitution fallacy (emphasis mine):

Everybody is continuing to commit the component substitution fallacy.

Now, remember, everything has finite resources, and you have to make trade-offs. You’re under resource pressure, you’re under profitability pressure, you’re under schedule pressure. Those are real, they never go to zero.

So, as you develop things, you make trade offs, you prioritize some things over other things. What that means is that when a problem happens, it will reveal component or subsystem weaknesses. The trade offs and assumptions and resource decisions you made guarantee there are component weaknesses. We can’t afford to perfect all components.

Yes, improving them is great and that can be a lesson afterwards, but if you substitute component weaknesses for the systems-level understanding of what was driving the event … at a more fundamental level of understanding, you’re missing the real lessons.

Seeing component weaknesses is a nice way to block seeing the system properties, especially because this justifies a minimal response and avoids any struggle that systemic changes require.

Woods on Shock and Resilience (25:04 mark)

Whenever an incident happens, we’re always able to point to different components in our system and say “there was the problem!” There was a microservice that didn’t handle a certain type of error gracefully, or there was bad data that had somehow gotten past our validation checks, or a particular cluster was under-resourced because it hadn’t been configured properly, and so on.

These are real issues that manifested as an outage, and they are worth spending the time to identify and follow up on. But these problems in isolation never tell the whole story of how the incident actually happened. As Woods explains in the excerpt of his talk above, because of the constraints we work under, we simply don’t have the time to harden the software we work on to the point where these problems don’t happen anymore. It’s just too expensive. And so, we make tradeoffs, we make judgments about where to best spend our time as we build, test, and roll out our stuff. The riskier we perceive a change, the more effort we’ll spend on validation and rollout of the change.

And so, if we focus only on issues with individual components, there’s so much we miss about the nature of failure in our systems. We miss looking at the unexpected interactions between the components that enabled the failure to happen. We miss how the organization’s prioritization decisions enabled the incident in the first place. We also don’t ask questions like “if we are going to do follow-up work to fix the component problems revealed by this incident, what are the things that we won’t be doing because we’re prioritizing this instead?” or “what new types of unexpected interactions might we be creating by making these changes?” Not to mention incident-handling questions like “how did we figure out something was wrong here?”

In the wake of an incident, if we focus only on the weaknesses of individual components then we won’t see the systemic issues. And it’s the systemic will continue to bite us long after we’ve implemented all of those follow-up action items. We’ll never see the forest for the trees.

Making peace with the imperfect nature of mental models

We all carry with us in our heads models about how the world works, which we colloquially refer to as mental models. These models are always incomplete, often stale, and sometimes they’re just plain wrong.

For those of us doing operations work, our mental models include our understanding of how the different parts of the system work. Incorrect mental models are always a factor in incidents: incidents are always surprises, and surprises are always discrepancies between our mental models and reality.

There are two things that are important to remember. First, our mental models are usually good enough for us to do our operations work effectively. Our human brains are actually surprisingly good at enabling us to do this stuff. Second, while a stale mental model is a serious risk, none of us have the time to constantly verify that all of our mental models are up to date. This is the equivalent of popping up an “are you sure?” modal dialog box before taking any action. (“Are you sure that pipeline that always deploys to the test environment still deploys to test first?”)

Instead, because our time and attention is limited, we have to get good at identifying cues to indicate that our models have gotten stale or are incorrect. But, since we won’t always get these cues, it’s inevitable that our mental models will go out of date. But that’s just an inevitable part of the job when you work in a dynamic environment. And we all work in dynamic environments.