Green is the color of complacency

Here are a few anecdotes about safety from the past few years.

In 2020, the world was struck by the COVID-19 pandemic. The U.S. response was… not great. Earlier in 2019, before the pandemic struck, the Johns Hopkins Center for Health Security released a pandemic preparedness assessment that ranked 195 countries on how well prepared they were to deal with a pandemic. The U.S. was ranked number one: it was identified as the most well-prepared country on earth.

With its pandemic playbook, “The U.S. was very well prepared,” said Eric Toner, senior scholar at the Johns Hopkins Center for Health Security. “What happened is that we didn’t do what we said we’d do. That’s where everything fell apart. We ended up being the best prepared and having one of the worst outcomes.”

On October 29, 2018, Lion Air Flight 610 crashed 13 minutes after takeoff, killing everyone on board. This plane was a Boeing 737 MAX, and a second 737 MAX had a fatal crash a few months later. Seven days prior to the Lion Air crash, the National Safety Council presented the Boeing Company with the Robert W. Campbell Award for leadership in safety:

“The Boeing Company is a leader in one of those most safety-centric industries in the world,” said Deborah A.P. Hersman, president and CEO of the National Safety Council. “Its innovative approaches to EHS excellence make it an ideal recipient of our most prestigious safety award. We are proud to honor them, and we appreciate their commitment to making our world safer.”

On April 20th, 2010, an explosion on the Deepwater Horizon offshore drilling rig killed eleven workers and led to the largest marine oil spill in the history of the industry. The year before, the U.S. Minerals Management Service issued its SAFE award to Deepwater Horizon:

MMS issued its SAFE award to Transocean for its performance in 2008, crediting the company’s “outstanding drilling operations” and a “perfect performance period.” Transocean spokesman Guy Cantwell told ABC News the awards recognized a spotless record during repeated MMS inspections, and should be taken as evidence of the company’s longstanding commitment to safety.

When things are going badly, everybody in the org knows it. If you go into an organization where high-severity incidents are happening on a regular basis, where everyone is constantly in firefighting mode, then you don’t need metrics to tell you how bad things are: it’s obvious to everyone, up and down the chain. The problems are all-too-visible. Everybody can feel them viscerally.

It’s when things aren’t always on fire that it can be very difficult to assess whether we need to allocate additional resources to reduce risk. As the examples above show, absence of incidents do not indicate an absence of risk. In fact, these quiet times can lull is into a sense of complacency, leading us to think that we’re in a good spot, when the truth is that there’s a significant risk that’s hidden beneath the surface.

Personally, I don’t believe it’s even possible to say with confidence that “everything is ok with right now”. As the cases above demonstrate, when things are quiet, there’s a limit to how well we can actually assess the risk based on the kinds of data we traditionally collect.

So, should you be worried about your system? If you find yourself constantly in firefighting mode, then, yes, you should be worried. And if things are running smoothly, and the availability metrics are all green? Then, also yes, you should be worried. You should always be worried. The next major incident is always just around the corner, no matter how high your ranking is, or how many awards you get.

The perils of outcome-based analysis

Imagine you wanted to understand how to get better at playing the lottery. You strike upon a research approach: study previous lottery winners! You collect a list of winners, look them up, interview them about how they go about choosing their numbers, collate this data, identify patterns, and use these to define strategies for picking numbers.

The problem with this approach is that it doesn’t tell you anything about how effective these strategies actually are. To really know how well these strategies work, you’d have to look at the entire population of people who employed them. For example, say that you find that most lottery winners use their birthdays to generate winning numbers. It may turn out, that for every winning ticket that has the ticket holder’s birthday, there are 20 million losing tickets that also have the ticket holder’s birthday. To understand a strategy’s effectiveness, you can’t just look at the winning outcomes: you have to look at the losing outcomes as well. The technical term for this type of analytic error is selecting on the dependent variable.

Here’s another example of this error in reasoning: according to the NHTSA, 32% of all traffic crash fatalities in the United States involve drunk drivers. That means that 68% of all traffic crash fatalities involve sober drivers. If you only look at scenarios that involve crash fatalities, it looks like being sober is twice as dangerous as being drunk! It’s a case of only looking at the dependent variable: crash fatalities. If we were to look at all driving scenarios, we’d see that there are a lot more sober drivers than drunk drivers, and that any given sober driver is less likely to get into a crash fatality than a given drunk driver. Being sober is safer, even though sober drivers appear more often in fatal accidents than drunk drivers.

Now, imagine an organization that holds a weekly lottery. But it’s a bizarro-world type of lottery: if someone wins, then they receive a bad outcome instead of a good one. And the bad outcome doesn’t just impact the “winner” (although they are impacted the most), it has negative consequences for the entire organization. Nobody would willingly participate in such a lottery, but everyone in the organization is required to: you can’t opt out. Every week, you have to buy a ticket, and hope the numbers you picked don’t come up.

The organization wants to avoid these negative outcomes, and so they try to identify patterns in how previous lottery “winners” picked their numbers, so that they can reduce the likelihood of future lottery wins by warning people against using these dangerous number-picking strategies.

At this point, the comparison to how we treat incidents should be obvious. If we only examine people’s actions in the wake of an incident, and not when things go well, then we fall into the trap of selecting on the dependent variable.

The real-world case is even worse than the lottery case: lotteries really are random, but that way that people do their work isn’t; rather, it’s adaptive. People do work in specific ways because they have found that it’s an effective way to get stuff done given that the constraints that they are under. The only way to really understand why people work the way they do is to understand how those adaptations usually succeed. Unless you’re really looking for it, you aren’t going to be able to learn how people develop successful adaptations if you only ever examine the adaptations when they fail. Otherwise, you’re just doing the moral equivalent of asking what lottery winners have in common.

The problem with invariants is that they change over time

Cliff L. Biffle blogged a great write-up of a debugging odyssey at Oxide with the title Who killed the network switch? Here’s the bit that jumped out at me:

At the time that code was written, it was correct, but it embodied the assumption that any loaned memory would fit into one region.

That assumption became obsolete the moment that Matt implemented task packing, but we didn’t notice. This code, which was still simple and easy to read, was now also wrong.

This type of assumption is an example of an invariant, a property of the system that is supposed to be guaranteed to not change over time. Invariants play an important role in formal methods (for example, see the section Writing an invariant in Hillel Wayne’s Learn TLA+ site).

Now, consider the following:

Our systems change over time. In particular, we will always make modifications to support new functionality that we could not have foreseen earlier in the lifecycle of the system.
Our code often rests on a number of invariants, properties that are currently true of our system and that we assume will always be true.
These invariants are implicit: the assumptions themselves are not explicitly represented in the source code. That means there’s no easy way to, say, mechanically extract them via static analysis.
A change can happen that violates an assumed invariant can be arbitrary far away from code that depends on the invariant to function properly.

What this means is that these kinds of failure modes are inevitable. If you’ve been in this business long enough, you’ve almost certainly run into an incident where one of the contributors was an implicit invariant that was violated by a new change. If you’re system lives long enough, it’s going to change. And one of those changes is eventually going to invalidate an assumption that somebody made long ago, which was a reasonable assumption to make at the time.

Implicit invariants are, by definition, impossible to enforce explicitly. They are time bombs. And they are everywhere.

What if everybody did everything right?

In the wake of an incident, we want to answer the questions “What happened?” and, afterwards, “What should we do differently going forward?” Invariably, this leads to people trying to answer the question “what went wrong?”, or, even more specifically, the two questions:

What did we do wrong here?
What didn’t we do that we should have?

There’s an implicit assumption behind these questions that because there was a bad outcome, that there must have been a bad action (or an absence of a good action) that led to that outcome. It’s such a natural conclusion to reach that I’ve only ever seen it questioned by people who have been exposed to concepts from resilience engineering.

In some sense, this belief in bad outcomes from bad actions is like Aristole’s claim that heavier objects fall faster than lighter ones. Intuitively, it seems obvious, but our intuitions lead us astray. But in another sense, it’s quite different, because it’s not something we can test by running an experiment. Instead, the idea that systems fail because somebody did something wrong (or didn’t do something right) is more like a lens or a frame, it’s a perspective, a way of making sense of the incident. It’s like how the fields of economics, psychology, and sociology act as different lenses for making sense of the world: a sociological explanation of a phenomenon (say, the First World War) will be different from an economic explanation, and we will get different insights from the different lenses.

An alternative lens for making sense of an incident is to ask the question “how did this incident happen, assuming that everybody did everything right?” In other words, assume that everybody whose actions contributed to the incident made the best possible decision based on the information they had, and the constraints and incentives that were imposed upon them.

Looking at the incident from this perspective will yield will very different kinds of insights, because it will generate different types of questions, such as:

What information did people know in the moment?
What were the constraints that people were operating under?

Now, I personally believe that the second perspective is strictly superior to the first, but I acknowledge that this is a judgment based on personal experience. However, even if you think the first perspective also has merit, if you truly want to maximize the amount of insight you get from a post-incident analysis, then I encourage you to try to the second perspective as well. Make the claim “Let’s assume everybody did everything right. How could this incident still have happened?” I guarantee, you’ll learn something new about your system that you didn’t know before.

Negotiability is non-negotiable

I’ve just started reading Trust in Numbers: The Pursuit of Objectivity in Science and Public Life by the historian of science Theodore Porter, and so far it’s fantastic. The first chapter discusses how, in the days before the metric system, even units of measure were negotiable. Porter gives two examples. One example comes from Poland, where the size of a unit of land would sometimes vary based on the quality of the soil, to make adjustments to equalize the amount of productive land, rather than simply the area of the land.

The other example Porter gives is about the reference vessel that towns would use to as their local definition of a bushel. You might think that defining a specific vessel as the bushel would give a fixed definition, but there was still flexibility. The amount of say, grain, or oat, that could be poured into the vessel could vary, depending on how it was poured (e.g., whether the material was “flattened” or “heaped”, the height it was poured from). This allowed people to make adjustments on the actual volume that constituted a bushel based on factors such as quality.

We humans have to build systems in order to scale up certain kinds of work: we couldn’t have large-scale organizations like governments and corporations without the technologies of bureaucracies and other forms of standardization. This is the sort of thing that James Scott calls legibility. But these formal systems demand fixed rules, which can never fully accommodate the messiness that comes with operating in the real world. And so, the people at the sharp end, the folks on the ground who are doing the actual work of making the system go, have to deal with this messiness that the system’s designers did not account for.

For these people to be able to do their work, there needs to be some give in the system. The people need to be able to exercise judgment, some ability to negotiate the boundaries of the rules of the system. Every human system is like this: the actors must have some amount of flexibility, otherwise the rules of the system will prevent people from being able to achieve the goals of the system. In other words, negotiability in a system is non-negotiable.

You should’ve known how to build a non-causal system

Reporting an outcome’s occurrence consistently increases its perceived likelihood and alters the judged relevance of data describing the situation preceding the event.
Baruch Fischhoff, Hindsight ≠ foresight: the effect of outcome knowledge on judgment under uncertainty, Journal of Experimental Psychology: Human Perception and
Performance 1975, Volume 1, pages 288–299

In my last blog post, I wrote about how computer scientists use execution histories to reason about consistency properties of distributed data structures. One class of consistency properties is known as causal consistency. In my post, I used an example that shows a violation of causal consistency, a property called writes follows reads.

Here’s the example I used, with timestamps added (note: this is a single-process example, there’s no multi-process concurrency here).

t=0: q.get() -> []
t=1: q.get() -> ["A: Hello"]
t=2: q.add("A: Hello")

Now, imagine this conversation between two engineers who are discussing this queue execution history.

A: “There’s something wrong with the queue behavior.”

B: “What do you mean?”

A: “Well, the queue was clearly empty at t=0, and then it had a value at t=1, even though there was no write.”

B: “Yes, there was, at t=2. That write is the reason why the queue read [“A: Hello”] at t=1.”

We would not accept that answer given by B, that the read seen at t=1 was due to the write that happened at t=2. The reason we would reject it is that this violates are notion of causality: the current output of a system cannot depend on its future inputs!

It’s not that we are opposed to the idea of causal systems in principle. We’d love to be able to build systems that can see into the future! It’s that such systems are not physically realizable, even though we can build mathematical models of their behavior. If you build a system whose execution histories violate causal consistency, you will be admonished by distributed systems engineers: something has gone wrong somewhere, because that behavior should not be possible. (In practice, what’s happened is that events have gotten reordered, rather than an engineer having accidentally built a system that can see into the future).

In the wake of an incident, we often experience the exact opposite problem: being admonished for failing to be part of a non-causal system. What happens is that someone will make an observation that the failure mode was actually foreseeable, and that engineers erred by not being able to anticipate it. Invariably, the phrase “should have known” will be used to describe this lack of foresight.

The problem is, this type of observation is only possible with knowledge of how things actually turned out. They believe that the outcome was foreseeable because they know that it happened. When you hear someone say “they should have known that…”, what that person is in fact saying is “the system’s behavior in the past failed to take into account future events”.

This sort of observation, while absurd, is seductive. And it happens often enough that researchers have a name for it: hindsight bias, or alternately, creeping determinism. The paper by the engineering researcher Baruch Fischhoff quoted at the top of this post documents a controlled experiment that demonstrates the phenomenon. However, you don’t need to look at the research literature to see this effect. Sadly, it’s all around us.

So, whenever you hear “X should have”, that should raise a red flag, because it’s an implicit claim that it’s possible to build non-causal systems. The distributed systems folks are right to insist on causal consistency. To berate someone for not building an impossible system is pure folly.

The inherent weirdness of system behavior

All implementations of mutable state in a geographically distributed system are either slow (require coordination when updating data) or weird (provide weak consistency only).
Sebastian Burckhardt, Principles of Eventual Consistency

The Generalized Uncertainty Principle (G.U.P.): Systems display antics. Alternatively: Complex systems exhibit unexpected behavior.
John Gall, The Systems Bible

When systems or organizations don’t work the way you think they should, it is generally not because the people in them are stupid or evil. It is because they are operating according to structures and incentives that aren’t obvious from the outside.
Jennifer Pahlka, Recoding America

It is also counterproductive by encouraging researchers and consultants and organizations to treat errors as a thing associated with people as a component — the reification fallacy (a kind of over-simplification), treating a set of interacting dynamic processes as if they were a single process.
David Woods, Sidney Dekker, Richard Cook, Leila Johannensen, Nadine Sarter, Behind Human Error

We humans solve problems by engineering systems. In a sense, a system is the opposite of a classical atom. Where an atom was conceived of as an indivisible entity, a system is made up of a set of interacting components. These components are organized in such a way that the overall system accomplishes a useful set of functions as conceived of by the designers.

Unfortunately, it’s impossible to build a perfect complex system. It’s also the case that we humans are very bad at reasoning about the behavior of unfamiliar complex systems when they deviate from our expectations.

The notion of consistency in distributed systems are a great example of this. Because distributed systems are, well, systems, that can exhibit behaviors that wouldn’t happen with atomic systems. The most intuitive notion of consistency, called linearizability, basically means “this concurrent data structure behaves the way you would expect a sequential data structure works”. And linearizability doesn’t even encompass everything! It’s only meaningful if there is a notion of a global clock (which isn’t the case in a distributed system), and it also only covers the case of single objects, which means it doesn’t cover transactions across multiple objects However, ensuring linearizability is difficult enough that we typically need to relax our consistency requirements when we build distributed systems, which means we need to choose a weaker model.

What I love about consistency models is that they aren’t treated as correctness models. Instead, they’re weirdness models: different levels of consistency will violate our intuitions relative to linearizability, and we need to choose what level of weirdness that we can actually implement and that is good enough for our application.

These sorts of consistency problems, where systems exhibit behaviors that violate our intuitions, is not specific to distributed software systems. In some cases, the weirdness of the system behavior leads to a negative outcome, the sort of thing that we call an incident. Often the negative outcome is attributed to the behavior of an individual agent within the system, where it gets labeled as “human error”. But as Woods et al. point out in the quote above, this attribution is based on an incorrect assumption on how systems actually behave.

The problem isn’t the people within the system. The weirdness arises from the interactions.

Tell me about a time…

Here are some proposed questions for interviewing someone for an SRE role. Really, these are just conversation starters to get them reflecting and discussing specific incident details.

The questions all start the same way: Tell me about a time when…

… action items that were completed in the wake of one incident changed system behavior in a way that ended up contributing to a future incident.

… someone deliberately violated the official change process in order to get work done, and things went poorly.

… someone deliberately violated the official change process in order to get work done, and things went well.

… you were burned by a coincidence (we were unlucky!).

… you were saved by a coincidence (we were lucky!).

… a miscommunication contributed to or exacerbated an incident.

… someone’s knowledge of the system was out of date, and them acting on this out-of-date knowledge contributed to or exacerbated an incident.

… something that was very obvious in hindsight was very confusing in the moment.

… somebody identified that something was wrong by noticing the absence of a signal.

… your system hit a type of limit that you had never breached before.

… you correctly diagnosed a problem “on a hunch”.

On chains and complex systems

We know that not all of the services in our system are critical. For example, some of our internal services provide support functions (e.g., observability, analytics), where others provide user enhancements that aren’t strictly necessary for the system to function (e.g., personalization). Given that we have a limited budget to spend on availability (we only get four quarters in a year, and our headcount is very finite), we should spend that budget wisely, by improving the reliability of the critical services.

to crystalize this idea, let’s use the metaphor of a metal chain. Imagine a chain where each link in the chain represents one of the critical services in your system. When one of these critical services fails, the chain breaks, and the system goes down. To improve the availability of your overall system, we need to:

Identify what the critical services in your system are (find the links in the chain).
Focus your resources on hardening those critical services that need it most (strengthen the weakest links).

This is an appealing model, because it gives us a clear path forward on our reliability work. First, we figure out which of our services are the critical ones. You’re probably pretty confident that you’ve identified a subset of these services (including from previous incidents!), but you also know there’s the ever-present risk of a once-noncritical service drifting into criticality. Once you have defined this set, you can prioritize your reliability efforts on shoring up these services, focusing on the ones that are understood to need the most help.

Unfortunately, there’s a problem with this model: complex systems don’t fail the way that chains do. In a complex system, there are an enormous number of couplings between the different components. A service that you think of as non-critical can have surprising impact on a critical service in many different ways. As a simple example, a non-critical service might write bad data into the system that the critical service reads and acts on. The way that a complex systems fails is through unexpected patterns of interactions among the components.

The space of potential unexpected patterns of interactions is so large as to be effectively unbounded. It simply isn’t possible for a human being to imagine all of the ways that these interactions can lead to a critical service misbehaving. This means that “hardening the critical services” will have limited returns to reliability, because it still leaves you vulnerable to these unexpected interactions.

The chain model is particularly pernicious because the model act as a filter that shapes a person’s understanding of an incident. If you believe that every incident can be attributed to an insufficiently hardened critical service, you’ll be able to identify that pattern in every incident that happens. And, indeed, you can patch up the problem to prevent the previous incident from happening again. But this perspective won’t help you guard against a different kind of dangerous interaction, one that you never could have imagined.

If you really want to understand how complex systems fail, you need to think in terms of webs rather than chains. Complex systems are made up of webs of interactions, many of which we don’t see. Next time you’re doing a post-incident review, look for these previously hidden webs instead of trying to find the broken link in the chain.

The courage to imagine other failures

All other things being equal, what’s more expensive for your business: a fifteen-minute outage or an eight-hour outage? If you had to pick one, which would you pick? Hold that thought.

Imagine that you work for a company that provides a software service over the internet. A few days ago, your company experienced an incident where the service went down for about four hours. Executives at the company are pretty upset about what happened: “we want to make certain this never happens again” is a phrase you’ve heard several times.

The company held a post-incident review, and the review process identified a number of actions items to prevent a recurrence of the incident. Some of this follow-up work has already been completed, but there other items that are going to take your team a significant amount of time and effort. You already had a decent backlog of reliability work that you had been planning on knocking out this quarter, but this incident has put this other work onto the back burner.

One night, the Oracle of Delphi appears to you in a dream.

Priestess of Delphi (1891) by John Collier

The Oracle tells you that if you prioritize the incident follow-up work, then in a month your system is going to suffer an even worse outage, one that is eight hours long. The failure mode for this outage will be very different from the last one. Ironically, one of the contributors to this outage will be an unintended change in system behavior that was triggered by the follow-up work. Another contributor to this incident was a known risk to the system that you were working on addressing, but that you had put off to the future after the incident changed priorities.

She goes on to tell you that if you instead do the reliability work that was on your backlog, you will avoid this outage. However, your system will instead experience a fifteen minute outage, with a failure mode that was very similar to the one you recently experienced. The impact will be much smaller because of the follow-up work that had already been completed, as well as the engineers now being more experienced with this type of failure.

Which path do you choose: the novel eight-hour outage, or the “it happened again!” fifteen minute outage?

By prioritizing doing preventative work from recent incidents, we are implicitly assuming that a recent incident is the one most likely to bite us again in the future. It’s important to remember that this is an illusion: we feel like the follow-up work is the most important thing we can do for reliability because we have a visceral sense of the incident we just went through. It’s much more real to us than a hypothetical, never-happened-before future incident. Unfortunately, we only have a finite amount of resources to spend on reliability work, and our memory of the recent incident does not mean that the follow-up work is the reliability work which will provide the highest return on investment.

In real life, we are never granted perfect information about the future consequences of our decisions. We have only our own judgment to guide us on how we should prioritize our work based on the known risks. Always prioritizing the action items from the last big incident is the easy path. The harder one is imagining the other types of incidents that might happen in the future, and recognizing that those might actually be worse than a recurrence. After all, you were surprised before. You’re going to be surprised again. That’s the real generalizable lesson of that last big incident.