You can’t judge risk in hindsight

A while back, the good folks at Google SRE posted an article titled Lessons Learned from Twenty Years of Site Reliability Engineering. There’s some great stuff in here, but I wanted to pick on the first lesson: The riskiness of a mitigation should scale with the severity of the outage. Here are some excerpts from the article (emphasis mine)

Let’s start back in 2016, when YouTube was offering your favorite videos such as “Carpool Karaoke with Adele” and the ever-catchy “Pen-Pineapple-Apple-Pen.” YouTube experienced a fifteen-minute global outage, due to a bug in YouTube’s distributed memory caching system, disrupting YouTube’s ability to serve videos.

We, here in SRE, have had some interesting experiences in choosing a mitigation with more risks than the outage it’s meant to resolve. During the aforementioned YouTube outage, a risky load-shedding process didn’t fix the outage… it instead created a cascading failure.

We learned the hard way that during an incident, we should monitor and evaluate the severity of the situation and choose a mitigation path whose riskiness is appropriate for that severity.

The question I had reading this was: how did the authors make the judgment that the load-shedding mitigation was risky? In particular, how was the risk of the mitigation perceived in the moment? Note: this question is still relevant, even if the authors/contributors were the actual responders!

When a bad outcome happens, it’s easy to say with hindsight that the action was risky. But we can really only judge the riskiness based on what was understood by the operators at the time they had to make the call. As the good Dr. Cook noted in the endlessly quotable How Complex Systems Fail, all practitioner actions are gambles:

After accidents, the overt failure often appears to have been inevitable and the
practitioner’s actions as blunders or deliberate willful disregard of certain impending failure. But all practitioner actions are actually gambles, that is, acts that take place in the face of uncertain outcomes. The degree of uncertainty may change from moment to moment. That practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.

I have no firsthand knowledge of this particular incident. But, just as nobody ever wakes up and says “I’m going to do a bad job today”, nobody wakes up and says “I’m going to take unnecessary risks today.” Doing operations work means making risk trade-offs under uncertainty. We generally don’t know in advance how risky a particular mitigation will be. I think the real lessons is to recognize the inherent challenge that operators face in these scenarios.

The problem with a root cause is that it explains too much

The recent performance of the stock market brings to mind the comment of a noted economist who was once asked whether the market is a good leading indicator of general economic activity. Wonderful, he replied sarcastically, it has predicted nine of the last four recessions. – Alfred L. Malabre Jr., 1968 March 4, The Wall Street Journal

In response to my previous post, Peter Ludemann made the following observation on Mastodon:

This post makes the case for why I would still call these contributors rather than root causes, even though they certainly sound root-cause-y. (They’re also fantastic examples of risks that are very common in the types of systems we work in, but that’s not the topic of this particular post).

Let’s take the first one, “a configuration system that makes mistakes easy.” I’d ask the question, “does an incident occur every single time somebody uses the configuration system?” I don’t know the details of the particular incident(s) that Peter is alluding to, but I’m willing to bet that this isn’t true. Rather, I assume what he is saying is that the configuration system is fundamentally unsafe in some way (e.g., it’s too easy to unintentionally take a dangerous action), and every once in a while a dangerous mistake would happen and an incident would occur.

What this means is that the unsafe configuration system by itself isn’t sufficient for the incident to occur! The config system enables incidents to occur, but it doesn’t, by itself, create the incident. Rather, it’s a combination of the configuration system, and some other factors, that trigger incidents. Maybe incidents only manifests when there is a particular action a user is trying to take, or maybe some people know how to work around the sharp edges and others don’t, or other things.

This may sound like sophistry. After all, the configuration system is an unsafe operator interface. The lesson from an incident is that we should fix it! However, here’s the problem with that line of thinking. The truth is that there are many types of these sorts of problems in a system. I like to call these problems vulnerabilities, even though people usually reserve that term in a security context. Peter gives three examples, but our systems are really shot through with these sorts of vulnerabilities. There are all sorts of unsafe operator interfaces, assumptions that have become invalidated with change, dangerous potential interactions between components, and so on. These vulnerabilities are the sorts of issues that the safety researcher James Reason referred to as latent pathogens. Reason is the one who proposed the Swiss cheese model, with the latent pathogens being the holes in the cheese.

My problem with labeling these vulnerabilities as root causes is that this obscures how our systems actually spend most of their time up, even though these vulnerabilities are always present. Let’s say you were able to identify every vulnerability you had in a system. If you label each one as a root cause of an outage, then your system should be down all of the time, because these vulnerabilities are all present in your system!

But your system isn’t down all of the time: in fact, it’s up more often than it’s down, even though these vulnerabilities are omnipresent. And the reason your system is up more than it’s down is that these vulnerabilities are not, by themselves, sufficient to take down a system. If you label these vulnerabilities as root causes, you make it impossible to understand to how your system actually succeeds. And if you don’t know how it succeeds, you can’t understand how it fails. You’re like the economist predicting recessions that don’t happen.

Now, whether we label these vulnerabilities as root causes or not, they clearly represent a risk to your system. But we have an additional problem: we live in the adaptive universe. That means we don’t actually have the resources (in particular, the time) to identify and patch all of these vulnerabilities. And, even if we could stop the world, find them all, and fix them all, and start the world again, our system keeps changing over time, and new vulnerabilities would set in. And that doesn’t even take into account how patching these vulnerabilities can create new ones. The adaptive universe also teaches us that our work will inevitably introduce new vulnerabilities because we only have a finite amount of time to actually do that work. Mistaking problems with individual components with the general problem of finite resources is the component substitution fallacy.

In short, labeling vulnerabilities as root causes is dangerous because it blinds us to the nature of how complex systems manage to stay up and running most of the time, even though vulnerabilities within the system are always with us. Now, these vulnerabilities are still risks! However, they may or may not manifest as incidents. In addition, we can’t predict which ones will bite us, and we don’t have the resources to root all of them out. We use “this just bit us so we should address it because otherwise it will bite us again” a heuristic, but it’s an implicit one. What we should be asking is “given that we have limited resources, is spending the time addressing this particular vulnerability worth the opportunity cost of delaying other work?”

Green is the color of complacency

Here are a few anecdotes about safety from the past few years.

In 2020, the world was struck by the COVID-19 pandemic. The U.S. response was… not great. Earlier in 2019, before the pandemic struck, the Johns Hopkins Center for Health Security released a pandemic preparedness assessment that ranked 195 countries on how well prepared they were to deal with a pandemic. The U.S. was ranked number one: it was identified as the most well-prepared country on earth.

With its pandemic playbook, “The U.S. was very well prepared,” said Eric Toner, senior scholar at the Johns Hopkins Center for Health Security. “What happened is that we didn’t do what we said we’d do. That’s where everything fell apart. We ended up being the best prepared and having one of the worst outcomes.”

On October 29, 2018, Lion Air Flight 610 crashed 13 minutes after takeoff, killing everyone on board. This plane was a Boeing 737 MAX, and a second 737 MAX had a fatal crash a few months later. Seven days prior to the Lion Air crash, the National Safety Council presented the Boeing Company with the Robert W. Campbell Award for leadership in safety:

“The Boeing Company is a leader in one of those most safety-centric industries in the world,” said Deborah A.P. Hersman, president and CEO of the National Safety Council. “Its innovative approaches to EHS excellence make it an ideal recipient of our most prestigious safety award. We are proud to honor them, and we appreciate their commitment to making our world safer.”  

On April 20th, 2010, an explosion on the Deepwater Horizon offshore drilling rig killed eleven workers and led to the largest marine oil spill in the history of the industry. The year before, the U.S. Minerals Management Service issued its SAFE award to Deepwater Horizon:

MMS issued its SAFE award to Transocean for its performance in 2008, crediting the company’s “outstanding drilling operations” and a “perfect performance period.” Transocean spokesman Guy Cantwell told ABC News the awards recognized a spotless record during repeated MMS inspections, and should be taken as evidence of the company’s longstanding commitment to safety.

When things are going badly, everybody in the org knows it. If you go into an organization where high-severity incidents are happening on a regular basis, where everyone is constantly in firefighting mode, then you don’t need metrics to tell you how bad things are: it’s obvious to everyone, up and down the chain. The problems are all-too-visible. Everybody can feel them viscerally.

It’s when things aren’t always on fire that it can be very difficult to assess whether we need to allocate additional resources to reduce risk. As the examples above show, absence of incidents do not indicate an absence of risk. In fact, these quiet times can lull is into a sense of complacency, leading us to think that we’re in a good spot, when the truth is that there’s a significant risk that’s hidden beneath the surface.

Personally, I don’t believe it’s even possible to say with confidence that “everything is ok with right now”. As the cases above demonstrate, when things are quiet, there’s a limit to how well we can actually assess the risk based on the kinds of data we traditionally collect.

So, should you be worried about your system? If you find yourself constantly in firefighting mode, then, yes, you should be worried. And if things are running smoothly, and the availability metrics are all green? Then, also yes, you should be worried. You should always be worried. The next major incident is always just around the corner, no matter how high your ranking is, or how many awards you get.

The perils of outcome-based analysis

Imagine you wanted to understand how to get better at playing the lottery. You strike upon a research approach: study previous lottery winners! You collect a list of winners, look them up, interview them about how they go about choosing their numbers, collate this data, identify patterns, and use these to define strategies for picking numbers.

The problem with this approach is that it doesn’t tell you anything about how effective these strategies actually are. To really know how well these strategies work, you’d have to look at the entire population of people who employed them. For example, say that you find that most lottery winners use their birthdays to generate winning numbers. It may turn out, that for every winning ticket that has the ticket holder’s birthday, there are 20 million losing tickets that also have the ticket holder’s birthday. To understand a strategy’s effectiveness, you can’t just look at the winning outcomes: you have to look at the losing outcomes as well. The technical term for this type of analytic error is selecting on the dependent variable.

Here’s another example of this error in reasoning: according to the NHTSA, 32% of all traffic crash fatalities in the United States involve drunk drivers. That means that 68% of all traffic crash fatalities involve sober drivers. If you only look at scenarios that involve crash fatalities, it looks like being sober is twice as dangerous as being drunk! It’s a case of only looking at the dependent variable: crash fatalities. If we were to look at all driving scenarios, we’d see that there are a lot more sober drivers than drunk drivers, and that any given sober driver is less likely to get into a crash fatality than a given drunk driver. Being sober is safer, even though sober drivers appear more often in fatal accidents than drunk drivers.

Now, imagine an organization that holds a weekly lottery. But it’s a bizarro-world type of lottery: if someone wins, then they receive a bad outcome instead of a good one. And the bad outcome doesn’t just impact the “winner” (although they are impacted the most), it has negative consequences for the entire organization. Nobody would willingly participate in such a lottery, but everyone in the organization is required to: you can’t opt out. Every week, you have to buy a ticket, and hope the numbers you picked don’t come up.

The organization wants to avoid these negative outcomes, and so they try to identify patterns in how previous lottery “winners” picked their numbers, so that they can reduce the likelihood of future lottery wins by warning people against using these dangerous number-picking strategies.

At this point, the comparison to how we treat incidents should be obvious. If we only examine people’s actions in the wake of an incident, and not when things go well, then we fall into the trap of selecting on the dependent variable.

The real-world case is even worse than the lottery case: lotteries really are random, but that way that people do their work isn’t; rather, it’s adaptive. People do work in specific ways because they have found that it’s an effective way to get stuff done given that the constraints that they are under. The only way to really understand why people work the way they do is to understand how those adaptations usually succeed. Unless you’re really looking for it, you aren’t going to be able to learn how people develop successful adaptations if you only ever examine the adaptations when they fail. Otherwise, you’re just doing the moral equivalent of asking what lottery winners have in common.

The problem with invariants is that they change over time

 Cliff L. Biffle blogged a great write-up of a debugging odyssey at Oxide with the title Who killed the network switch? Here’s the bit that jumped out at me:

At the time that code was written, it was correct, but it embodied the assumption that any loaned memory would fit into one region.

That assumption became obsolete the moment that Matt implemented task packing, but we didn’t notice. This code, which was still simple and easy to read, was now also wrong.

This type of assumption is an example of an invariant, a property of the system that is supposed to be guaranteed to not change over time. Invariants play an important role in formal methods (for example, see the section Writing an invariant in Hillel Wayne’s Learn TLA+ site).

Now, consider the following:

  • Our systems change over time. In particular, we will always make modifications to support new functionality that we could not have foreseen earlier in the lifecycle of the system.
  • Our code often rests on a number of invariants, properties that are currently true of our system and that we assume will always be true.
  • These invariants are implicit: the assumptions themselves are not explicitly represented in the source code. That means there’s no easy way to, say, mechanically extract them via static analysis.
  • A change can happen that violates an assumed invariant can be arbitrary far away from code that depends on the invariant to function properly.

What this means is that these kinds of failure modes are inevitable. If you’ve been in this business long enough, you’ve almost certainly run into an incident where one of the contributors was an implicit invariant that was violated by a new change. If you’re system lives long enough, it’s going to change. And one of those changes is eventually going to invalidate an assumption that somebody made long ago, which was a reasonable assumption to make at the time.

Implicit invariants are, by definition, impossible to enforce explicitly. They are time bombs. And they are everywhere.

What if everybody did everything right?

In the wake of an incident, we want to answer the questions “What happened?” and, afterwards, “What should we do differently going forward?” Invariably, this leads to people trying to answer the question “what went wrong?”, or, even more specifically, the two questions:

  • What did we do wrong here?
  • What didn’t we do that we should have?

There’s an implicit assumption behind these questions that because there was a bad outcome, that there must have been a bad action (or an absence of a good action) that led to that outcome. It’s such a natural conclusion to reach that I’ve only ever seen it questioned by people who have been exposed to concepts from resilience engineering.

In some sense, this belief in bad outcomes from bad actions is like Aristole’s claim that heavier objects fall faster than lighter ones. Intuitively, it seems obvious, but our intuitions lead us astray. But in another sense, it’s quite different, because it’s not something we can test by running an experiment. Instead, the idea that systems fail because somebody did something wrong (or didn’t do something right) is more like a lens or a frame, it’s a perspective, a way of making sense of the incident. It’s like how the fields of economics, psychology, and sociology act as different lenses for making sense of the world: a sociological explanation of a phenomenon (say, the First World War) will be different from an economic explanation, and we will get different insights from the different lenses.

An alternative lens for making sense of an incident is to ask the question “how did this incident happen, assuming that everybody did everything right?” In other words, assume that everybody whose actions contributed to the incident made the best possible decision based on the information they had, and the constraints and incentives that were imposed upon them.

Looking at the incident from this perspective will yield will very different kinds of insights, because it will generate different types of questions, such as:

  • What information did people know in the moment?
  • What were the constraints that people were operating under?

Now, I personally believe that the second perspective is strictly superior to the first, but I acknowledge that this is a judgment based on personal experience. However, even if you think the first perspective also has merit, if you truly want to maximize the amount of insight you get from a post-incident analysis, then I encourage you to try to the second perspective as well. Make the claim “Let’s assume everybody did everything right. How could this incident still have happened?” I guarantee, you’ll learn something new about your system that you didn’t know before.

You should’ve known how to build a non-causal system

Reporting an outcome’s occurrence consistently increases its perceived likelihood and alters the judged relevance of data describing the situation preceding the event.

Baruch Fischhoff, Hindsight ≠ foresight: the effect of outcome knowledge on judgment under uncertainty, Journal of Experimental Psychology: Human Perception and
Performance 1975, Volume 1, pages 288–299

In my last blog post, I wrote about how computer scientists use execution histories to reason about consistency properties of distributed data structures. One class of consistency properties is known as causal consistency. In my post, I used an example that shows a violation of causal consistency, a property called writes follows reads.

Here’s the example I used, with timestamps added (note: this is a single-process example, there’s no multi-process concurrency here).

t=0: q.get() -> []
t=1: q.get() -> ["A: Hello"]
t=2: q.add("A: Hello")

Now, imagine this conversation between two engineers who are discussing this queue execution history.


A: “There’s something wrong with the queue behavior.”

B: “What do you mean?”

A: “Well, the queue was clearly empty at t=0, and then it had a value at t=1, even though there was no write.”

B: “Yes, there was, at t=2. That write is the reason why the queue read [“A: Hello”] at t=1.”


We would not accept that answer given by B, that the read seen at t=1 was due to the write that happened at t=2. The reason we would reject it is that this violates are notion of causality: the current output of a system cannot depend on its future inputs!

It’s not that we are opposed to the idea of causal systems in principle. We’d love to be able to build systems that can see into the future! It’s that such systems are not physically realizable, even though we can build mathematical models of their behavior. If you build a system whose execution histories violate causal consistency, you will be admonished by distributed systems engineers: something has gone wrong somewhere, because that behavior should not be possible. (In practice, what’s happened is that events have gotten reordered, rather than an engineer having accidentally built a system that can see into the future).

In the wake of an incident, we often experience the exact opposite problem: being admonished for failing to be part of a non-causal system. What happens is that someone will make an observation that the failure mode was actually foreseeable, and that engineers erred by not being able to anticipate it. Invariably, the phrase “should have known” will be used to describe this lack of foresight.

The problem is, this type of observation is only possible with knowledge of how things actually turned out. They believe that the outcome was foreseeable because they know that it happened. When you hear someone say “they should have known that…”, what that person is in fact saying is “the system’s behavior in the past failed to take into account future events”.

This sort of observation, while absurd, is seductive. And it happens often enough that researchers have a name for it: hindsight bias, or alternately, creeping determinism. The paper by the engineering researcher Baruch Fischhoff quoted at the top of this post documents a controlled experiment that demonstrates the phenomenon. However, you don’t need to look at the research literature to see this effect. Sadly, it’s all around us.

So, whenever you hear “X should have”, that should raise a red flag, because it’s an implicit claim that it’s possible to build non-causal systems. The distributed systems folks are right to insist on causal consistency. To berate someone for not building an impossible system is pure folly.

Tell me about a time…

Here are some proposed questions for interviewing someone for an SRE role. Really, these are just conversation starters to get them reflecting and discussing specific incident details.

The questions all start the same way: Tell me about a time when…

… action items that were completed in the wake of one incident changed system behavior in a way that ended up contributing to a future incident.

… someone deliberately violated the official change process in order to get work done, and things went poorly.

… someone deliberately violated the official change process in order to get work done, and things went well.

… you were burned by a coincidence (we were unlucky!).

… you were saved by a coincidence (we were lucky!).

… a miscommunication contributed to or exacerbated an incident.

… someone’s knowledge of the system was out of date, and them acting on this out-of-date knowledge contributed to or exacerbated an incident.

… something that was very obvious in hindsight was very confusing in the moment.

… somebody identified that something was wrong by noticing the absence of a signal.

… your system hit a type of limit that you had never breached before.

… you correctly diagnosed a problem “on a hunch”.

On chains and complex systems

Photo by Matthew Lancaster

We know that not all of the services in our system are critical. For example, some of our internal services provide support functions (e.g., observability, analytics), where others provide user enhancements that aren’t strictly necessary for the system to function (e.g., personalization). Given that we have a limited budget to spend on availability (we only get four quarters in a year, and our headcount is very finite), we should spend that budget wisely, by improving the reliability of the critical services.

to crystalize this idea, let’s use the metaphor of a metal chain. Imagine a chain where each link in the chain represents one of the critical services in your system. When one of these critical services fails, the chain breaks, and the system goes down. To improve the availability of your overall system, we need to:

  1. Identify what the critical services in your system are (find the links in the chain).
  2. Focus your resources on hardening those critical services that need it most (strengthen the weakest links).

This is an appealing model, because it gives us a clear path forward on our reliability work. First, we figure out which of our services are the critical ones. You’re probably pretty confident that you’ve identified a subset of these services (including from previous incidents!), but you also know there’s the ever-present risk of a once-noncritical service drifting into criticality. Once you have defined this set, you can prioritize your reliability efforts on shoring up these services, focusing on the ones that are understood to need the most help.

Unfortunately, there’s a problem with this model: complex systems don’t fail the way that chains do. In a complex system, there are an enormous number of couplings between the different components. A service that you think of as non-critical can have surprising impact on a critical service in many different ways. As a simple example, a non-critical service might write bad data into the system that the critical service reads and acts on. The way that a complex systems fails is through unexpected patterns of interactions among the components.

The space of potential unexpected patterns of interactions is so large as to be effectively unbounded. It simply isn’t possible for a human being to imagine all of the ways that these interactions can lead to a critical service misbehaving. This means that “hardening the critical services” will have limited returns to reliability, because it still leaves you vulnerable to these unexpected interactions.

The chain model is particularly pernicious because the model act as a filter that shapes a person’s understanding of an incident. If you believe that every incident can be attributed to an insufficiently hardened critical service, you’ll be able to identify that pattern in every incident that happens. And, indeed, you can patch up the problem to prevent the previous incident from happening again. But this perspective won’t help you guard against a different kind of dangerous interaction, one that you never could have imagined.

If you really want to understand how complex systems fail, you need to think in terms of webs rather than chains. Complex systems are made up of webs of interactions, many of which we don’t see. Next time you’re doing a post-incident review, look for these previously hidden webs instead of trying to find the broken link in the chain.

The courage to imagine other failures

All other things being equal, what’s more expensive for your business: a fifteen-minute outage or an eight-hour outage? If you had to pick one, which would you pick? Hold that thought.

Imagine that you work for a company that provides a software service over the internet. A few days ago, your company experienced an incident where the service went down for about four hours. Executives at the company are pretty upset about what happened: “we want to make certain this never happens again” is a phrase you’ve heard several times.

The company held a post-incident review, and the review process identified a number of actions items to prevent a recurrence of the incident. Some of this follow-up work has already been completed, but there other items that are going to take your team a significant amount of time and effort. You already had a decent backlog of reliability work that you had been planning on knocking out this quarter, but this incident has put this other work onto the back burner.

One night, the Oracle of Delphi appears to you in a dream.

Priestess of Delphi (1891) by John Collier

The Oracle tells you that if you prioritize the incident follow-up work, then in a month your system is going to suffer an even worse outage, one that is eight hours long. The failure mode for this outage will be very different from the last one. Ironically, one of the contributors to this outage will be an unintended change in system behavior that was triggered by the follow-up work. Another contributor to this incident was a known risk to the system that you were working on addressing, but that you had put off to the future after the incident changed priorities.

She goes on to tell you that if you instead do the reliability work that was on your backlog, you will avoid this outage. However, your system will instead experience a fifteen minute outage, with a failure mode that was very similar to the one you recently experienced. The impact will be much smaller because of the follow-up work that had already been completed, as well as the engineers now being more experienced with this type of failure.

Which path do you choose: the novel eight-hour outage, or the “it happened again!” fifteen minute outage?

By prioritizing doing preventative work from recent incidents, we are implicitly assuming that a recent incident is the one most likely to bite us again in the future. It’s important to remember that this is an illusion: we feel like the follow-up work is the most important thing we can do for reliability because we have a visceral sense of the incident we just went through. It’s much more real to us than a hypothetical, never-happened-before future incident. Unfortunately, we only have a finite amount of resources to spend on reliability work, and our memory of the recent incident does not mean that the follow-up work is the reliability work which will provide the highest return on investment.

In real life, we are never granted perfect information about the future consequences of our decisions. We have only our own judgment to guide us on how we should prioritize our work based on the known risks. Always prioritizing the action items from the last big incident is the easy path. The harder one is imagining the other types of incidents that might happen in the future, and recognizing that those might actually be worse than a recurrence. After all, you were surprised before. You’re going to be surprised again. That’s the real generalizable lesson of that last big incident.