Any change can break us, but we can’t treat every change the same

Here are some excerpts from an incident story told by John Allspaw about his time at Etsy (circa 2012), titled Learning Effectively From Incidents: The Messy Details.

In this story, the site goes down:

September 2012 afternoon, this is a tweet from the Etsy status account saying that there’s an issue on the site… People said, oh, the site’s down. People started noticing that the site is down.

NEW: Site Issues http://t.co/Xkc0zLi2
— Etsy Status (@etsystatus) September 5, 2012

Possibly the referenced issue?

This is a tough outage: the web servers are down so hard that they aren’t even reachable:

And people said, well, actually it’s going to be hard to even deploy because we can’t even get to the servers. And people said, well, we can barely get them to respond to a ping. We’re going to have to get people on the console, the integrated lights out for hard reboots. And people even said, well, because we’re talking about hundreds of web servers. Could it be faster, we could even just power cycle these. This is a big deal here. So whatever it wasn’t in the deploy that caused the issue, it made hundreds of web servers completely hung, completely unavailable.

One of the contributors? A CSS change to remove support for old browsers!

And one of the tasks was with the performance team and the issue was old browsers. You always have these workarounds because the internet didn’t fulfill the promise of standards. So, let’s get rid of the support for IE version seven and older. Let’s get rid of all the random stuff. …
And in this case, we had this template-based template used as far as we knew everything, and this little header-ie.css, was the actual workaround. And so the idea was, let’s remove all the references to this CSS file in this base template and we’ll remove the CSS file.

How does a CSS change contribute to a major outage?

The request would come in for something that wasn’t there, 404 would happen all the time. The server would say, well, I don’t have that. So I’m going to give you a 404 page and so then I got to go and construct this 404 page, but it includes this reference to the CSS file, which isn’t there, which means I have to send a 404 page. You might see where I’m going back and forth, 404 page, fire a 404 page, fire a 404 page. Pretty soon all of the 404s are keeping all of the Apache servers, all of the Apache processes across hundreds of servers hung, nothing could be done.

I love this story because a CSS change feels innocuous. CSS just controls presentation, right? How could that impact availability? From the story (emphasis mine)

And this had been tested and reviewed by multiple people. It’s not all that big of a deal of a change, which is why it was a task that was sort of slated for the next person who comes through boot camp in the performance team.

The reason a CSS change can cascade into an outage is that in a complex system there are all of these couplings that we don’t even know are there until we get stung by them.

One lesson you might take away from this story is “you should treat every proposed change like it could bring down the entire system”. But I think that’s the wrong lesson. The reason I think so is because of another constraint we all face: finite resources. Perhaps in a world where we always had an unlimited amount of time to make any changes, we could take this approach. But we don’t live in that world. We only have a fixed number of hours in a week, which means we need to budget our time. And so we make judgment calls on how much time we’re going to spend on manually validating a change based on how risky we perceive that change to be. When I review someone else’s pull request, for example, the amount of effort I spend on it is going to vary based on the nature of the change. For example, I’m going to look more closely at changes to database schemas than I am to changes in log messages.

But that means that we’re ultimately going to miss some of these CSS-change-breaks-the-site kinds of changes. It’s fundamentally inevitable that this is going to happen: it’s simply in the nature of complex systems. You can try to add process to force people to scrutinize every change with the same level of effort, but unless you remove schedule pressure, that’s not going to have the desired effect. People are going to make efficiency-thoroughness tradeoffs because they are held accountable for hitting their OKRs, and they can’t achieve those OKR if they put in the same amount of effort to evaluate every single production change.

Given that we can’t avoid such failures, the best we can do is to be ready to respond to them.

“Human error” means they don’t understand how the system worked

One of the services that the Amazon cloud provides is called S3, which is a data storage service. Imagine a hypothetical scenario where S3 had a major outage, and Amazon’s explanation of the outage was “a hard drive failed”.

Engineers wouldn’t believe this explanation. It’s not that they would doubt that a hard drive failed; we know that hard drives fail all of the time. In fact, it’s precisely because hard drives are prone to failure, and S3 stays up, that they wouldn’t accept this as an explanation. S3 has been architected to function correctly even in the face of individual hard drives failing. While a failed hard drive could certainly be a contributor to an outage, it can’t be the whole story. Otherwise, S3 would constantly be going down. To say “S3 went down because a hard drive failed” is to admit “I don’t know how S3 normally works when it experiences hard drive failures”.

We accept “human error” as the explanation for failures of reliable systems. Now, I’m a bit of an extremist when it comes to the idea of human error, I believe it simply doesn’t exist. But, let’s put that aside for now, and assume that human error is a real thing, and people make mistakes. The thing is, humans are constantly making mistakes. Every day, in every organization, there are many people that are making many mistakes. The people who work on systems that stay up most of the time are not some sort of hyper-vigilant super-humans that make fewer mistakes than the rest of us. Rather, these people are embedded within systems that have evolved over time to be resistant to these sorts of individual mistakes.

As the late Dr. Richard Cook (no fan of the concept of “human error” himself) put it in How Complex Systems Fail: “Complex systems are heavily and successfully defended against failure”. As a consequence of this, “Catastrophe requires multiple failures – single point failures are not enough.”

Reliable systems are error-tolerant. There are mechanisms within such systems to guard against the kinds of mistakes that people make on a regular basis. Ironically, these mechanisms are not necessarily designed into the system: they can evolve organically and invisibly. But they are there, and they are the reason that these systems stay up day after day.

What this means is that when someone attributes a failure to “human error”, it means that they do not see these defenses in the system, and so they don’t actually have an understanding of how all of these defenses failed in this scenario. When you hear “human error” as an explanation for why a system failed, you should think “this person doesn’t know how the system stays up.” Because without knowing how the system stays up, it is impossible to understand the cases where it comes down.

(I believe Cook himself said something to the effect of “human error is the point where they stopped asking questions”).

Accidents manage you

Here’s a a line I liked from episode 461 of Todd Conklin’s PreAccident Investigation Podcast. At around the 8:25 mark, Conklin says:

….accidents, in fact, aren’t preventable. Accidents manage you, so what you really manage is the capacity for the organization to fail safely.

The phrasing “accidents manage you” is great, because it drives home the fact that an incident is not something that we can control. When an incident happens, the system has, quite literally, gone out of control.

While there’s no action we can take that will prevent all incidents, there are things we can do in advance to limit the harm that result from these future incidents. We can build what Conklin calls capacity. This capacity to absorb risk is the thing that we have control over. But it doesn’t come for free: it requires an investment of time and resources.

The surprising power of a technical document written by experts

Good technical writing can have enormous influence. In my last blog post, I wrote about how technical reports written by management consultants can be used to support implementing a change program inside of an organization.

People underestimate how influential such technical document can be. They have to be written by experts to be effective, and management consultants are really just mercenary “experts”, but they aren’t the only type of experts who can write influential documents.

I was recently listening to on an episode of the Ezra Klein Show, where climate scientist Kate Marvel was being interviewed by (guest interviewer) David Wallace-Wells, when I heard another example of this phenomenon.

Here’s an excerpt from the transcript (emphasis added):

(Marvel) And in, I want to say 2018 because that was the release of the U.N.‘s 1.5 degree Special Report — which, mea culpa, I was grouchy about.

I thought it was fan fiction. I thought, well, there’s no way we’re going to limit warming to 1.5 degrees. Why are you doing this? And oh, boy. What the world needs is another report. Great. Let’s do that again. And for reasons that I don’t understand, I was so wrong.

I was so wrong about how that was going to be received. I was so wrong about how that would land. And it started something. Now —

(Wallace-Wells) The same year that Greta started striking, the foundation of XR, the sit-in of Sunrise.

(Marvel) Sunrise. To talk about tipping points, that’s not something that I was able to anticipate. And now, I almost never get asked, is it real? I almost never get asked, well, what does climate change mean and why should I care? Instead, I get asked the really good questions about uncertainty, about what’s happening, about how we can prepare, about what we can do.

The irony here is that Marvel is a scientist, a professional whose primary output is technical documents! And, yet, Marvel didn’t recognize the impact that a technical report could have on the overall system. It didn’t actually matter that it’s not possible to limit warning to 1.5°C. What mattered was how the document itself ended up changing the system.

Don’t underestimate the power of a technical document. Like any effective system intervention, it has to happen at the right place and the right time. But, if it does, it can make a real difference.

Normal incidents

In 1984, the late sociologist Charles Perrow published the book: Normal Accidents: Living with High-Risk Technologies. In this book, he proposed a theory that accidents were unavoidable in systems that had certain properties, and nuclear power plants had these properties. In such systems, accidents would inevitably occur during the normal course of operations.

You don’t hear much about Perrow’s Normal Accident Theory these days, as it has been superseded by other theories in safety science, such as High Reliability Organizations and Resilience Engineering (although see Hopkins’s 2013 paper Issues in safety science for criticisms of all three theories). But even rejecting the specifics of Perrow’s theory, the idea of a normal accident or incident is a useful one.

An incident is an abnormal event. Because of this, we assume, reasonably, that an incident must have an abnormal cause: something must have gone wrong in order for the incident to have happened. And so we look to find where the abnormal work was, where it was that someone exercised poor judgment that ultimately led to the incident.

But incidents can happen as a result of normal work, when everyone whose actions contributed to the incident was actually exercising reasonable judgment at the time they committed those actions.

This concept, that all actions and decisions that contributed to an incident were reasonable in the moment they were made, is unintuitive. It requires a very different conceptual model of how incidents happen. But, once you adopt this conceptual model, it completely changes the way you understand incidents. You shift from asking “what was the abnormal work?” to “how did this incident happen even though everyone was doing normal work?” And this yields very different insights into how the system actually works, how it is that incidents don’t usually happen due to normal work, and how it is that they occasionally do.

Why LFI is a tough sell

There are two approaches to doing post-incident analysis:

the (traditional) root cause analysis (RCA) perspective
the (more recent) learning from incidents (LFI) perspective

In the RCA perspective, the occurrence of an incident has demonstrated that there is a vulnerability that caused the incident to happen, and the goal of the analysis is to identify and eliminate the vulnerability.

In the LFI perspective, an incident presents the organization with an opportunity to learn about the system. The goal is to learn as much as possible with the time that the organization is willing to devote to post-incident work.

The RCA approach has the advantage of being intuitively appealing. The LFI approach, by contrast, has three strikes against it:

LFI requires more time and effort than RCA
LFI requires more skill than RCA
It’s not obvious what advantages LFI provides over RCA.

I think the value of LFI approach is based on assumptions that people don’t really think about because these assumptions are not articulated explicitly.

In this post, I’m going to highlight two of them.

Nobody knows how the system really works

The LFI approach makes the following assumption: No individual in the organization will ever have an accurate mental model about how the entire system works. To put it simply:

It’s the stuff we don’t know that bites us
There’s always stuff we don’t know

By “system” here, I mean the socio-technical system, which includes both the software and what it does, and the humans who do the work to develop and operate the system.

You’ll see the topic of incorrect mental models discussed in the safety literature in various ways. For example, David Woods uses the term miscalibration to describe incorrect mental models, and Diane Vaughan writes about structural secrecy, which is a mechanism that leads to incorrect mental models.

But incorrect mental models are not something we talk much about explicitly in the software world. The RCA approach implicitly assumes there’s only a single thing that we didn’t know: the root cause of the incident. Once we find that, we’re done.

To believe that the LFI approach is worth doing, you need to believe that there is a whole bunch of things about the system that people don’t know, not just a single vulnerability. And there are some things that, say, Alice knows that Bob doesn’t, and that Alice doesn’t know that Bob doesn’t know.

Better system understanding leads to better decision making in the future

The payoff for RCA is clear: the elimination of a known vulnerability. But the payoff for LFI is a lot fuzzier: if the people in the organization know more about the system, they are going to make better decisions in the future.

The problem with articulating the value is that we don’t know when these future decisions will be made. For example, the decision might happen when responding to the next incident (e.g., now I know how to use that observability tool because I learned from how someone else used it effectively in the last incident). Or the decision might happen during the design phase of a future software project (e.g., I know to shard my services by request type because I’ve seen what can go wrong when “light” and “heavy” requests are serviced by same cluster) or during the coding phase (e.g., I know to explicitly set a reasonable timeout because Java’s default timeout is way too high).

The LFI approach assumes that understanding the system better will advance the expertise of the engineers in the organization, and that better expertise means better decision making.

On the one hand, organizations recognize that expertise leads to better decision making: it’s why they are willing to hire senior engineers even though junior engineers are cheaper. On the other hand, hiring seems to be the only context where this is explicitly recognized. “This activity will advance the expertise of our staff, and hence will lead to better future outcomes, so it’s worth investing in” is the kind of mentality that is required to justify work like the LFI approach.

Active knowledge

Existential Comics is an extremely nerdy webcomic about philosophers, written and drawn by Corey Mohler, a software engineer(!). My favorite Existential Comics strip is titled Is a Hotdog a Sandwich? A Definitive Study. The topic is… exactly what you would expect:

At the risk of explaining a joke: the punchline is that we can conclude that a hotdog isn’t a sandwich because people don’t generally refer to hotdogs as sandwiches. In Wittgenstein’s view, the meaning of a phrase isn’t determined by a set of formal criteria. Instead, language is use.

In a similar spirit, in his book Designing Engineers, Louis Bucciarelli proposed that we should understand “knowing how something works” to mean knowing how to work it. He begins with an anecdote about telephones:

A few years ago, I attended a national conference on technological literacy… One of the main speakers, a sociologist, presented data he had gathered in the form of responses to a questionnaire. After a detailed statistical analysis, he had concluded that we are a nation of technological illiterates. As an example, he noted how few of us (less than 20 percent) know how our telephone works.

This statement brought me up short. I found my mind drifting and filling with anxiety. Did I know how my telephone works?

Bucciarelli tries to get at what the speaker actually intended by “knowing how a telephone works”.

I squirmed in my seat, doodled some, then asked myself, What does it mean to know how a telephone works? Does it mean knowing how to dial a local or long-distance number? Certainly I knew that much, but this does not seem to be the issue here.

He dives down a level of abstraction into physical implementation details.

No, I suspected the question to be understood at another level, as probing the respondent’s knowledge of what we might call the “physics of the device.”

I called to mind an image of a diaphragm, excited by the pressure variations of speaking, vibrating and driving a coil back and forth within a a magnetic field… If this was what the speaker meant, then he was right: Most of us don’t know how our telephone works.

But then Bucciarelli continues to elaborate this scenario:

Indeed, I wondered, does [the speaker] know how his telephone works? Does he know about the heuristics used to achieve optimum routing for long distance calls? Does he know about the intricacies of the algorithms used for echo and noise suppression? Does he know how a signal is transmitted to and retrieved from a satellite in orbit? Does he know how AT&T, MCI, and the local phone companies are able to use the same network simultaneously? Does he know how many operators are needed to keep this system working, or what those repair people actually do when they climb a telephone pole? Does he know about corporate financing, capital investment strategies, or the role of regulation in the functioning of this expansive and sophisticated communication system?

Does anyone know how their telephone works?

At this point, I couldn’t help thinking of that classic tech interview question, “What happens when you type a URL into address bar of your web browser and hit enter”? It’s a fun question to ask precisely because there are so many different aspects to the overall system that you could potentially dig in on (Do you know how your operating system services keyboard interrupts? How your local Wi-Fi protocol works?). Can anyone really say that they understand everything that happens after hitting enter?

Because no individual possesses this type of comprehensive knowledge of engineered systems, Bucciarelli settles on a definition that relies on active knowledge: knowing-how-it-works as knowing-how-to-use-it.

No, the “knowing how it works” that has meaning and significance is knowing how to do something with the telephone—how to act on it and react to it, how to engage and appropriate the technology according to one’s needs and responsibilities.

I thought of Bucciarelli’s definition while reading Andy Clark’s book Surfing Uncertainty. In Chapter 6, Clark claims that our brain does not need to account for all of its sensory input to build a model of what’s happening in the world. Instead, it relies in simpler models that are sufficient for determining how to act (emphasis mine).

This may well result … in the use of simple models whose power resides precisely in their failing to encode every detail and nuance present in the sensory array. For knowing the world, in the only sense that can matter to an evolved organism, means being able to act in that world: being able to respond quickly and efficiently to salient environmental opportunities.

The through line that connects Wittgenstein, Bucciarelli, and Clark, is the idea of knowledge as an active thing. Knowing implies using and acting. To paraphrase David Woods, knowledge is a verb.

Resilience requires helping each other out

A common failure mode in complex systems is that some part of the system hits a limit and falls over. In the software world, we call this phenomenon resource exhaustion, and a classic example of this is running out of memory.

The simplest solution to this problem is to “provision for peak”: to build out the system so that it always has enough resources to handle the theoretical maximum load. Alas, this solution isn’t practical: it’s too expensive. Even if you manage to overprovision the system, over time, it will get stretched to its limits. We need another way to mitigate the risk of overload.

Fortunately, it’s rare for every component of a system to reach its limit simultaneously: while one component might get overloaded, there are likely other components that have capacity to spare. That means that if one component is in trouble, it can borrow resources from another one.

Indeed, we see this sort of behavior in biological systems. In the paper Allostasis: A Model of Predictive Regulation, the neuroscientist Peter Sterling explains why allostasis is a better theory than homeostasis. Readers are probably familiar with the term homeostasis: it refers to how your body maintains factors in a narrow range, like keeping your body temperature around 98.6°F. Allostasis, on the other hand, is about how your body predicts where these sorts of levels should be, based on anticipated need. Your body then takes action to modify the current state of these levels. Here’s Sterling explaining why he thinks allostasis is superior, referencing the idea of borrowing resources across organs (emphasis mine)

A second reason why homeostatic control would be inefficient is that if each organ self-regulated independently, opportunities would be missed for efficient trade-offs. Thus each organ would require its own reserve capacity; this would require additional fuel and blood, and thus more digestive capacity, a larger heart, and so on – to support an expensive infrastructure rarely used. Efficiency requires organs to trade-off resources, that is, to grant each other short-term loans.

The systems we deal with are not individual organisms, but organizations that are made up of groups of people. In organization-style systems, this sort of resource borrowing becomes more complex. Incentives in the system might make me less inclined to lend you resources, even if doing so would lead to better outcomes for the overall system. In his paper The Theory of Graceful Extensibility: Basic rules that govern adaptive systems, David Woods borrows the term reciprocity from Elinor Ostrom to describe this property in a system of one agent being willing to lend resources to another as a necessary ingredient for resilience (emphasis mine):

Will the neighboring units adapt in ways that extend the [capacity for maneuver] of the adaptive unit at risk? Or will the neighboring units behave in ways that further constrict the [capacity for maneuver] of the adaptive unit at risk? Ostrom (2003) has shown that reciprocity is an essential property of networks of adaptive units that produce sustained adaptability.

I couldn’t help thinking of the Sterling and Woods papers when reading the latest issue of Nat Bennett’s Simpler Machines newsletter, titled What was special about Pivotal? Nat’s answer is reciprocity:

This isn’t always how it went at Pivotal. But things happened this way enough that it really did change people’s expectations about what would happen if they co-operated – in the game theory, Prisoner’s Dilemma sense. Pivotal was an environment where you could safely lead with co-operation. Folks very rarely “defected” and screwed you over if you led by trusting them.

People helped each other a lot. They asked for help a lot. We solved a lot of problems much faster than we would have otherwise, because we helped each other so much. We learned much faster because we helped each other so much.

And it was generally worth it to do a lot of things that only really work if everyone’s consistent about them. It was worth it to write tests, because everyone did. It was worth it to spend time fixing and removing flakes from tests, because everyone did. It was worth it to give feedback, because people changed their behavior. It was worth it to suggest improvements, because things actually got better.

There was a lot of reciprocity.

Nat’s piece is a good illustration of the role that culture plays in enabling a resilient organization. I suspect it’s not possible to impose this sort of culture, it has to be fostered. I wish this were more widely appreciated.

If you can’t tell a story about it, it isn’t real

We use stories to make sense of the world. What that means is that when events occur that don’t fit neatly into a narrative, we can’t make sense of them. As a consequence, these sorts of events are less salient, which means they’re less real.

In The Invisible Victims of American Anti-Semitism, Yair Rosenberg wrote in the Atlantic about the kinds of attacks that target Jews which don’t get much attention in the larger media. His claim is that this happens when these attacks don’t fit into existing narratives about anti-Semitism (emphasis mine):

What you’ll also notice is that all of the very real instances of anti-Semitism discussed above don’t fall into either of these baskets. Well-off neighborhoods passing bespoke ordinances to keep out Jews is neither white supremacy nor anti-Israel advocacy gone awry. Nor can Jews being shot and beaten up in the streets of their Brooklyn or Los Angeles neighborhoods by largely nonwhite assailants be blamed on the usual partisan bogeymen.

That’s why you might not have heard about these anti-Semitic acts. It’s not that politicians or journalists haven’t addressed them; in some cases, they have. It’s that these anti-Jewish incidents don’t fit into the usual stories we tell about anti-Semitism, so they don’t register, and are quickly forgotten if they are acknowledged at all.

In The 1918 Flu Faded in Our Collective Memory: We Might ‘Forget’ the Coronavirus, Too, Scott Hershberger speculated in Scientific American along similar lines about why historians paid little attention the Spanish Flu epidemic, even though it killed more people than World War I (emphasis mine):

For the countries engaged in World War I, the global conflict provided a clear narrative arc, replete with heroes and villains, victories and defeats. From this standpoint, an invisible enemy such as the 1918 flu made little narrative sense. It had no clear origin, killed otherwise healthy people in multiple waves and slinked away without being understood. Scientists at the time did not even know that a virus, not a bacterium, caused the flu. “The doctors had shame,” Beiner says. “It was a huge failure of modern medicine.” Without a narrative schema to anchor it, the pandemic all but vanished from public discourse soon after it ended.

I’m a big believer in the role of interactions, partial information, uncertainty, workarounds, tradeoffs, and goal conflicts as contributors to systems failures. I think the way to convince other people to treat these entities as first-class is to weave them into the stories we tell about how incidents happen. If we want people to see these things as real, we have to integrate them into narrative descriptions of incidents.

Because, If we can’t tell a story about something, it’s as if it didn’t happen.

When there’s no plan for this scenario, you’ve got to improvise

An incident is happening. Your distributed system has somehow managed to get itself stuck in a weird state. There’s a runbook, but because the authors didn’t foresee this failure mode ever happening, the runbook isn’t actually helpful here. To get the system back into a healthy state, you’re going to have to invent a solution on the spot.

In other words, you’re going to have to to improvise.

“We gotta find a way to make this fit into the hole for this using nothing but that.” – scene from Apollo 13

Like uncertainty, improvisation is an aspect of incident response that we typically treat as a one-off, rather than as a first-class skill that we should recognize and cultivate. Not every incident requires improvisation to resolve, but the hairiest ones will. And it’s these most complex of incidents that are the ones we need to worry about the most, because they’re the ones that are costliest to the business.

One of the criticisms of resilience engineering as a field is that it isn’t prescriptive. Often, a response I’ll hear about resilience engineering research I talk about is “OK, Lorin, that’s interesting, but what should I actually do?” I think resilience engineering is genuinely helpful, and in this case it teaches us that improvisation requires local expertise, autonomy, and effective coordination.

To improvise a solution, you have to be able to effectively use the tools and technologies that you have on hand that are directly available to you in this situation, what Claude Levi Strauss referred to as bricolage. That means you have to know what those tools are and you have to be skilled in their use. That’s the local expertise part. You’ll often need to leverage what David Woods calls a generic capability in order to solve the problem at hand. That’s some element of technology that wasn’t explicitly designed to do what you need, but is generic enough that you can use it.

Improvisation also requires that the people with the expertise have the authority to take required actions. They’re going to need the ability to do risky things, which could potentially end up making things worse. That’s the autonomy part.

Finally, because of the complex nature of incidents, you will typically need to work with multiple people to resolve things. It may be that you don’t have the requisite expertise or autonomy, but somebody else does. Or it may be that the improvised strategy requires coordination across a group of people. I remember one time when I was the incident commander where there was a problem that was affecting a large number of services and the only remediation strategy was to restart or re-deploy the affected services: we had to effectively “reboot the fleet”. The deployment tooling at the time didn’t support that sort of bulk activity, so we had to do it manually. A group of us, sitting in the war room (this was in pre-COVID days), divvied up the work of reaching out to all of the relevant service owners. We coordinated using Google sheets. (In general, I’m opposed to writing automation scripts during an incident if doing the task manually is just as quick, because the blast radius of that sort of script is huge, and those scripts generally don’t get tested well before use because of the urgency).

While we don’t know exactly what we’ll be called on to do during an incident, we can prepare to improvise. For more on this topic, check out Matt Davis’s piece on Site Reliability Engineering and the Art of Improvisation.