The courage to imagine other failures

All other things being equal, what’s more expensive for your business: a fifteen-minute outage or an eight-hour outage? If you had to pick one, which would you pick? Hold that thought.

Imagine that you work for a company that provides a software service over the internet. A few days ago, your company experienced an incident where the service went down for about four hours. Executives at the company are pretty upset about what happened: “we want to make certain this never happens again” is a phrase you’ve heard several times.

The company held a post-incident review, and the review process identified a number of actions items to prevent a recurrence of the incident. Some of this follow-up work has already been completed, but there other items that are going to take your team a significant amount of time and effort. You already had a decent backlog of reliability work that you had been planning on knocking out this quarter, but this incident has put this other work onto the back burner.

One night, the Oracle of Delphi appears to you in a dream.

Priestess of Delphi (1891) by John Collier

The Oracle tells you that if you prioritize the incident follow-up work, then in a month your system is going to suffer an even worse outage, one that is eight hours long. The failure mode for this outage will be very different from the last one. Ironically, one of the contributors to this outage will be an unintended change in system behavior that was triggered by the follow-up work. Another contributor to this incident was a known risk to the system that you were working on addressing, but that you had put off to the future after the incident changed priorities.

She goes on to tell you that if you instead do the reliability work that was on your backlog, you will avoid this outage. However, your system will instead experience a fifteen minute outage, with a failure mode that was very similar to the one you recently experienced. The impact will be much smaller because of the follow-up work that had already been completed, as well as the engineers now being more experienced with this type of failure.

Which path do you choose: the novel eight-hour outage, or the “it happened again!” fifteen minute outage?

By prioritizing doing preventative work from recent incidents, we are implicitly assuming that a recent incident is the one most likely to bite us again in the future. It’s important to remember that this is an illusion: we feel like the follow-up work is the most important thing we can do for reliability because we have a visceral sense of the incident we just went through. It’s much more real to us than a hypothetical, never-happened-before future incident. Unfortunately, we only have a finite amount of resources to spend on reliability work, and our memory of the recent incident does not mean that the follow-up work is the reliability work which will provide the highest return on investment.

In real life, we are never granted perfect information about the future consequences of our decisions. We have only our own judgment to guide us on how we should prioritize our work based on the known risks. Always prioritizing the action items from the last big incident is the easy path. The harder one is imagining the other types of incidents that might happen in the future, and recognizing that those might actually be worse than a recurrence. After all, you were surprised before. You’re going to be surprised again. That’s the real generalizable lesson of that last big incident.

Any change can break us, but we can’t treat every change the same

Here are some excerpts from an incident story told by John Allspaw about his time at Etsy (circa 2012), titled Learning Effectively From Incidents: The Messy Details.

In this story, the site goes down:

September 2012 afternoon, this is a tweet from the Etsy status account saying that there’s an issue on the site… People said, oh, the site’s down. People started noticing that the site is down.

Possibly the referenced issue?

This is a tough outage: the web servers are down so hard that they aren’t even reachable:

And people said, well, actually it’s going to be hard to even deploy because we can’t even get to the servers. And people said, well, we can barely get them to respond to a ping. We’re going to have to get people on the console, the integrated lights out for hard reboots. And people even said, well, because we’re talking about hundreds of web servers. Could it be faster, we could even just power cycle these. This is a big deal here. So whatever it wasn’t in the deploy that caused the issue, it made hundreds of web servers completely hung, completely unavailable.

One of the contributors? A CSS change to remove support for old browsers!

And one of the tasks was with the performance team and the issue was old browsers. You always have these workarounds because the internet didn’t fulfill the promise of standards. So, let’s get rid of the support for IE version seven and older. Let’s get rid of all the random stuff. …
And in this case, we had this template-based template used as far as we knew everything, and this little header-ie.css, was the actual workaround. And so the idea was, let’s remove all the references to this CSS file in this base template and we’ll remove the CSS file.

How does a CSS change contribute to a major outage?

The request would come in for something that wasn’t there, 404 would happen all the time. The server would say, well, I don’t have that. So I’m going to give you a 404 page and so then I got to go and construct this 404 page, but it includes this reference to the CSS file, which isn’t there, which means I have to send a 404 page. You might see where I’m going back and forth, 404 page, fire a 404 page, fire a 404 page. Pretty soon all of the 404s are keeping all of the Apache servers, all of the Apache processes across hundreds of servers hung, nothing could be done.

I love this story because a CSS change feels innocuous. CSS just controls presentation, right? How could that impact availability? From the story (emphasis mine)

And this had been tested and reviewed by multiple people. It’s not all that big of a deal of a change, which is why it was a task that was sort of slated for the next person who comes through boot camp in the performance team.

The reason a CSS change can cascade into an outage is that in a complex system there are all of these couplings that we don’t even know are there until we get stung by them.

One lesson you might take away from this story is “you should treat every proposed change like it could bring down the entire system”. But I think that’s the wrong lesson. The reason I think so is because of another constraint we all face: finite resources. Perhaps in a world where we always had an unlimited amount of time to make any changes, we could take this approach. But we don’t live in that world. We only have a fixed number of hours in a week, which means we need to budget our time. And so we make judgment calls on how much time we’re going to spend on manually validating a change based on how risky we perceive that change to be. When I review someone else’s pull request, for example, the amount of effort I spend on it is going to vary based on the nature of the change. For example, I’m going to look more closely at changes to database schemas than I am to changes in log messages.

But that means that we’re ultimately going to miss some of these CSS-change-breaks-the-site kinds of changes. It’s fundamentally inevitable that this is going to happen: it’s simply in the nature of complex systems. You can try to add process to force people to scrutinize every change with the same level of effort, but unless you remove schedule pressure, that’s not going to have the desired effect. People are going to make efficiency-thoroughness tradeoffs because they are held accountable for hitting their OKRs, and they can’t achieve those OKR if they put in the same amount of effort to evaluate every single production change.

Given that we can’t avoid such failures, the best we can do is to be ready to respond to them.

“Human error” means they don’t understand how the system worked

One of the services that the Amazon cloud provides is called S3, which is a data storage service. Imagine a hypothetical scenario where S3 had a major outage, and Amazon’s explanation of the outage was “a hard drive failed”.

Engineers wouldn’t believe this explanation. It’s not that they would doubt that a hard drive failed; we know that hard drives fail all of the time. In fact, it’s precisely because hard drives are prone to failure, and S3 stays up, that they wouldn’t accept this as an explanation. S3 has been architected to function correctly even in the face of individual hard drives failing. While a failed hard drive could certainly be a contributor to an outage, it can’t be the whole story. Otherwise, S3 would constantly be going down. To say “S3 went down because a hard drive failed” is to admit “I don’t know how S3 normally works when it experiences hard drive failures”.

We accept “human error” as the explanation for failures of reliable systems. Now, I’m a bit of an extremist when it comes to the idea of human error, I believe it simply doesn’t exist. But, let’s put that aside for now, and assume that human error is a real thing, and people make mistakes. The thing is, humans are constantly making mistakes. Every day, in every organization, there are many people that are making many mistakes. The people who work on systems that stay up most of the time are not some sort of hyper-vigilant super-humans that make fewer mistakes than the rest of us. Rather, these people are embedded within systems that have evolved over time to be resistant to these sorts of individual mistakes.

As the late Dr. Richard Cook (no fan of the concept of “human error” himself) put it in How Complex Systems Fail: Complex systems are heavily and successfully defended against failure”. As a consequence of this, “Catastrophe requires multiple failures – single point failures are not enough.”

Reliable systems are error-tolerant. There are mechanisms within such systems to guard against the kinds of mistakes that people make on a regular basis. Ironically, these mechanisms are not necessarily designed into the system: they can evolve organically and invisibly. But they are there, and they are the reason that these systems stay up day after day.

What this means is that when someone attributes a failure to “human error”, it means that they do not see these defenses in the system, and so they don’t actually have an understanding of how all of these defenses failed in this scenario. When you hear “human error” as an explanation for why a system failed, you should think “this person doesn’t know how the system stays up.” Because without knowing how the system stays up, it is impossible to understand the cases where it comes down.

(I believe Cook himself said something to the effect of “human error is the point where they stopped asking questions”).

For want of a dollar

Back in August, The New York Times ran a profile of Morris Chang, the founder of TSMC.

It’s hard to overstate the role that this Taiwan-based semiconductor company plays in the industry. If you search for articles about it, you’ll see headlines like TSMC: The Most Important Tech Company You Never Heard Of and TSMC: how a Taiwanese chipmaker became a linchpin of the global economy.

What struck me in the NY Times article was this anecdote about Chang’s search for a job after he failed out of a Ph.D. program at MIT in 1955 (emphasis mine):

Two of the best offers arrived from Ford Motor Company and Sylvania, a lesser-known electronics firm. Ford offered Mr. Chang $479 a month for a job at its research and development center in Detroit. Though charmed by the company’s recruiters, Mr. Chang was surprised to find the offer was $1 less than the $480 a month that Sylvania offered.

When he called Ford to ask for a matching offer, the recruiter, who had previously been kind, turned hostile and told him he would not get a cent more. Mr. Chang took the engineering job with Sylvania. There, he learned about transistors, the microchip’s most basic component.

“That was the start of my semiconductor career,” he said. “In retrospect, it was a damn good thing.”

The course of history changed because an internal recruiter Ford refused to offer him an additional dollar a month ($11.46 in 2023 dollars) to match a competing offer!

This is the sort of thing that historians call contingency.

Accidents manage you

Here’s a a line I liked from episode 461 of Todd Conklin’s PreAccident Investigation Podcast. At around the 8:25 mark, Conklin says:

….accidents, in fact, aren’t preventable. Accidents manage you, so what you really manage is the capacity for the organization to fail safely.

The phrasing “accidents manage you” is great, because it drives home the fact that an incident is not something that we can control. When an incident happens, the system has, quite literally, gone out of control.

While there’s no action we can take that will prevent all incidents, there are things we can do in advance to limit the harm that result from these future incidents. We can build what Conklin calls capacity. This capacity to absorb risk is the thing that we have control over. But it doesn’t come for free: it requires an investment of time and resources.

The surprising power of a technical document written by experts

Good technical writing can have enormous influence. In my last blog post, I wrote about how technical reports written by management consultants can be used to support implementing a change program inside of an organization.

People underestimate how influential such technical document can be. They have to be written by experts to be effective, and management consultants are really just mercenary “experts”, but they aren’t the only type of experts who can write influential documents.

I was recently listening to on an episode of the Ezra Klein Show, where climate scientist Kate Marvel was being interviewed by (guest interviewer) David Wallace-Wells, when I heard another example of this phenomenon.

Here’s an excerpt from the transcript (emphasis added):

(Marvel) And in, I want to say 2018 because that was the release of the U.N.‘s 1.5 degree Special Report — which, mea culpa, I was grouchy about.

I thought it was fan fiction. I thought, well, there’s no way we’re going to limit warming to 1.5 degrees. Why are you doing this? And oh, boy. What the world needs is another report. Great. Let’s do that again. And for reasons that I don’t understand, I was so wrong.

I was so wrong about how that was going to be received. I was so wrong about how that would land. And it started something. Now —

(Wallace-Wells) The same year that Greta started striking, the foundation of XR, the sit-in of Sunrise.

(Marvel) Sunrise. To talk about tipping points, that’s not something that I was able to anticipate. And now, I almost never get asked, is it real? I almost never get asked, well, what does climate change mean and why should I care? Instead, I get asked the really good questions about uncertainty, about what’s happening, about how we can prepare, about what we can do.

The irony here is that Marvel is a scientist, a professional whose primary output is technical documents! And, yet, Marvel didn’t recognize the impact that a technical report could have on the overall system. It didn’t actually matter that it’s not possible to limit warning to 1.5°C. What mattered was how the document itself ended up changing the system.

Don’t underestimate the power of a technical document. Like any effective system intervention, it has to happen at the right place and the right time. But, if it does, it can make a real difference.

On productivity metrics and management consultants

The management consulting firm McKinsey & Company recently posted a blog post titled Yes, you can measure software developer productivity. The post prompted a lot of responses, such as Kent Beck and Gergely Orosz’s Measuring developer productivity? A response to McKinsey, Dan North’s The Worst Programmer I Know, and John Cutler’s The Ultimate Guide to Developer Counter-Productivity.

Now, I’m an avowed advocate of qualitative approaches to studying software development, but I started out my academic research career on the quantitative side, doing research into developer productivity metrics. And so I started to read the McKinsey post with the intention of writing a response, on why qualitative approaches are better for gaining insight into productivity issues. And I hope to write that post soon. But something jumped out at me that changed what I wanted to write about today. It was this line in particular (emphasis mine):

For example, one company that had previously completed a successful agile transformation learned that its developers, instead of coding, were spending too much time on low-value-added tasks such as provisioning infrastructure, running manual unit tests, and managing test data. Armed with that insight, it launched a series of new tools and automation projects to help with those tasks across the software development life cycle.

I realized that I missed the whole point of this post. The goal isn’t to gain insight, it’s to justify funding a new program inside an organization.

In order to effect change in an organization, you need political capital, even if you’re an executive. That’s because making change in an organization is hard, programs are expensive and don’t bear fruit for a long time, and so you need to get buy-in in order to make things happen.

McKinsey is a management consulting firm. One of the services that management consulting firms provide is that they will sell you political capital. They provide a report generated by external experts that their customers can use as leverage within their organizations to justify change programs.

As Lee Clarke describes in his book Mission Improbable: Using Fantasy Documents to Tame Disaster, technical reports written by experts have rhetorical, symbolic power, even if the empirical foundations of the reports are weak. Clarke’s book focuses on the unverified nature of disaster recovery documents, but the same holds true for reports based on software productivity metrics.

If you want to institute a change to a software development organization, and you don’t have the political capital to support it, then building a metrics program that will justify your project is a pretty good strategy if you can pull that off: if you can define metrics that will support the outcome that you want, and you can get the metrics program in place, then you can use it as ammunition for the new plan. (“We’re spending too much time on toil, we should build out a system to automate X”).

Of course, this sounds extremely cynical. You’re creating a metrics program where you know in advance what the metrics are going to show, with the purpose of justifying a new program you’ve already thought of? You’re claiming that you want to study a problem when you already have a proposed solution in the wings! But this is just how organizations work.

And, so, it makes perfect sense that McKinsey & Company would write a blog post like this. They are, effectively, a political-capital-as-a-service (PCaaS?) company. Helping executives justify programs inside of companies is what they do for a living. But they can’t simply state explicitly how the magic trick actually works, because then it won’t work anymore.

The danger is when the earnest folks, the ones who are seeking genuine insight into the nature of software productivity issues in their organization, read a post like this. Those are the ones I want to talk to about the value of a qualitative approach for gaining insight.

Operating effectively in high surprise mode

When you deploy a service into production, you need to configure it with enough resources (e.g., CPU, memory) so that it can handle the volume of requests you expect it to receive. You’ll want to provision it so that it can service 100% of the requests when receiving the typical amount of traffic, and you probably want some buffer in there as well.

However, as a good operator, you also know that sometimes your service will receive an unexpected increase in traffic that’s a large enough to push your service beyond the resources that you’ve been provisioned for it, even with that extra buffer.

When your service is overloaded, even though it can’t service 100% of the requests, you want to design it so that it doesn’t simply keel over and service 0% of the requests. There are well-known patterns for designing a service to degrade gracefully in the face of overload, such that it can still service some requests, and that keep it from getting so overloaded that it can’t even recover when the traffic abates. These patterns include rate limiters and circuit breakers. Michael Nygard’s book Release It! is a great source for this, and the concepts he describes have been implemented in libraries such as Hystrix and Resilience4j.

You can think of “expected number of requests” and “too many requests” as two different modes of operation of your service: you want to design it so that it performs well in both modes.

A service switching operational modes from “normal amount of requests” to “too many requests”

Now, imagine in the graph above, instead of the y-axis being “number of requests seen by the service”, it’s “degree of surprise experienced by the operators”.

As we humans navigate the world, we are constantly taking in sensory input. Imagine if I asked you, at regular intervals, “On a scale of 1-10, how surprised are you about your current observations of the world?”, and I plotted it on a graph like the one above. During a typical day, the way we experience the world isn’t too surprising. However, every so often, our observations of the world just don’t make sense to us. The things we’re seeing just shouldn’t be happening, given our mental models of how the world works. When it’s the software’s behavior that’s surprising, and that surprising behavior has a significant negative impact on the business, we call it an incident.

And, just like a software service behaves differently under a very high rate of inbound requests than it does from the typical rate, your socio-technical system (which includes your software and your people) is going to behave differently under high levels of surprise than it does under typical levels.

Similarly, just like you can build your software system to deal more effectively with overload, you can also influence your socio-technical system to deal more effectively with surprise. That’s really what the research field of resilience engineering is about: understanding how some socio-technical systems are more effective than others when working in high surprise mode.

It’s important to note that being more effective at high surprise mode is not the same as trying to eliminate surprises in the future. Adding more capacity to your software service enables it to handle more traffic, but it doesn’t help deal with the situation where the traffic exceeds even those extra resources. Rather, your system needs to be able to change what it does under overload. Similarly, saying “we are going to make sure we handle this scenario in the future” does nothing to improve your system’s ability to function effectively in high surprise mode.

I promise you, your system is going to enter high surprise mode in the future. The number of failure modes that you have eliminated does nothing to improve your ability to function well when this mode happens. While RCA will eliminate a known failure mode, LFI will help your system function better in high surprise mode.

Normal incidents

In 1984, the late sociologist Charles Perrow published the book: Normal Accidents: Living with High-Risk Technologies. In this book, he proposed a theory that accidents were unavoidable in systems that had certain properties, and nuclear power plants had these properties. In such systems, accidents would inevitably occur during the normal course of operations.

You don’t hear much about Perrow’s Normal Accident Theory these days, as it has been superseded by other theories in safety science, such as High Reliability Organizations and Resilience Engineering (although see Hopkins’s 2013 paper Issues in safety science for criticisms of all three theories). But even rejecting the specifics of Perrow’s theory, the idea of a normal accident or incident is a useful one.

An incident is an abnormal event. Because of this, we assume, reasonably, that an incident must have an abnormal cause: something must have gone wrong in order for the incident to have happened. And so we look to find where the abnormal work was, where it was that someone exercised poor judgment that ultimately led to the incident.

But incidents can happen as a result of normal work, when everyone whose actions contributed to the incident was actually exercising reasonable judgment at the time they committed those actions.

This concept, that all actions and decisions that contributed to an incident were reasonable in the moment they were made, is unintuitive. It requires a very different conceptual model of how incidents happen. But, once you adopt this conceptual model, it completely changes the way you understand incidents. You shift from asking “what was the abnormal work?” to “how did this incident happen even though everyone was doing normal work?” And this yields very different insights into how the system actually works, how it is that incidents don’t usually happen due to normal work, and how it is that they occasionally do.

Why LFI is a tough sell

There are two approaches to doing post-incident analysis:

  • the (traditional) root cause analysis (RCA) perspective
  • the (more recent) learning from incidents (LFI) perspective

In the RCA perspective, the occurrence of an incident has demonstrated that there is a vulnerability that caused the incident to happen, and the goal of the analysis is to identify and eliminate the vulnerability.

In the LFI perspective, an incident presents the organization with an opportunity to learn about the system. The goal is to learn as much as possible with the time that the organization is willing to devote to post-incident work.

The RCA approach has the advantage of being intuitively appealing. The LFI approach, by contrast, has three strikes against it:

  1. LFI requires more time and effort than RCA
  2. LFI requires more skill than RCA
  3. It’s not obvious what advantages LFI provides over RCA.

I think the value of LFI approach is based on assumptions that people don’t really think about because these assumptions are not articulated explicitly.

In this post, I’m going to highlight two of them.

Nobody knows how the system really works

The LFI approach makes the following assumption: No individual in the organization will ever have an accurate mental model about how the entire system works. To put it simply:

  • It’s the stuff we don’t know that bites us
  • There’s always stuff we don’t know

By “system” here, I mean the socio-technical system, which includes both the software and what it does, and the humans who do the work to develop and operate the system.

You’ll see the topic of incorrect mental models discussed in the safety literature in various ways. For example, David Woods uses the term miscalibration to describe incorrect mental models, and Diane Vaughan writes about structural secrecy, which is a mechanism that leads to incorrect mental models.

But incorrect mental models are not something we talk much about explicitly in the software world. The RCA approach implicitly assumes there’s only a single thing that we didn’t know: the root cause of the incident. Once we find that, we’re done.

To believe that the LFI approach is worth doing, you need to believe that there is a whole bunch of things about the system that people don’t know, not just a single vulnerability. And there are some things that, say, Alice knows that Bob doesn’t, and that Alice doesn’t know that Bob doesn’t know.

Better system understanding leads to better decision making in the future

The payoff for RCA is clear: the elimination of a known vulnerability. But the payoff for LFI is a lot fuzzier: if the people in the organization know more about the system, they are going to make better decisions in the future.

The problem with articulating the value is that we don’t know when these future decisions will be made. For example, the decision might happen when responding to the next incident (e.g., now I know how to use that observability tool because I learned from how someone else used it effectively in the last incident). Or the decision might happen during the design phase of a future software project (e.g., I know to shard my services by request type because I’ve seen what can go wrong when “light” and “heavy” requests are serviced by same cluster) or during the coding phase (e.g., I know to explicitly set a reasonable timeout because Java’s default timeout is way too high).

The LFI approach assumes that understanding the system better will advance the expertise of the engineers in the organization, and that better expertise means better decision making.

On the one hand, organizations recognize that expertise leads to better decision making: it’s why they are willing to hire senior engineers even though junior engineers are cheaper. On the other hand, hiring seems to be the only context where this is explicitly recognized. “This activity will advance the expertise of our staff, and hence will lead to better future outcomes, so it’s worth investing in” is the kind of mentality that is required to justify work like the LFI approach.