Operating effectively in high surprise mode

When you deploy a service into production, you need to configure it with enough resources (e.g., CPU, memory) so that it can handle the volume of requests you expect it to receive. You’ll want to provision it so that it can service 100% of the requests when receiving the typical amount of traffic, and you probably want some buffer in there as well.

However, as a good operator, you also know that sometimes your service will receive an unexpected increase in traffic that’s a large enough to push your service beyond the resources that you’ve been provisioned for it, even with that extra buffer.

When your service is overloaded, even though it can’t service 100% of the requests, you want to design it so that it doesn’t simply keel over and service 0% of the requests. There are well-known patterns for designing a service to degrade gracefully in the face of overload, such that it can still service some requests, and that keep it from getting so overloaded that it can’t even recover when the traffic abates. These patterns include rate limiters and circuit breakers. Michael Nygard’s book Release It! is a great source for this, and the concepts he describes have been implemented in libraries such as Hystrix and Resilience4j.

You can think of “expected number of requests” and “too many requests” as two different modes of operation of your service: you want to design it so that it performs well in both modes.

A service switching operational modes from “normal amount of requests” to “too many requests”

Now, imagine in the graph above, instead of the y-axis being “number of requests seen by the service”, it’s “degree of surprise experienced by the operators”.

As we humans navigate the world, we are constantly taking in sensory input. Imagine if I asked you, at regular intervals, “On a scale of 1-10, how surprised are you about your current observations of the world?”, and I plotted it on a graph like the one above. During a typical day, the way we experience the world isn’t too surprising. However, every so often, our observations of the world just don’t make sense to us. The things we’re seeing just shouldn’t be happening, given our mental models of how the world works. When it’s the software’s behavior that’s surprising, and that surprising behavior has a significant negative impact on the business, we call it an incident.

And, just like a software service behaves differently under a very high rate of inbound requests than it does from the typical rate, your socio-technical system (which includes your software and your people) is going to behave differently under high levels of surprise than it does under typical levels.

Similarly, just like you can build your software system to deal more effectively with overload, you can also influence your socio-technical system to deal more effectively with surprise. That’s really what the research field of resilience engineering is about: understanding how some socio-technical systems are more effective than others when working in high surprise mode.

It’s important to note that being more effective at high surprise mode is not the same as trying to eliminate surprises in the future. Adding more capacity to your software service enables it to handle more traffic, but it doesn’t help deal with the situation where the traffic exceeds even those extra resources. Rather, your system needs to be able to change what it does under overload. Similarly, saying “we are going to make sure we handle this scenario in the future” does nothing to improve your system’s ability to function effectively in high surprise mode.

I promise you, your system is going to enter high surprise mode in the future. The number of failure modes that you have eliminated does nothing to improve your ability to function well when this mode happens. While RCA will eliminate a known failure mode, LFI will help your system function better in high surprise mode.

Normal incidents

In 1984, the late sociologist Charles Perrow published the book: Normal Accidents: Living with High-Risk Technologies. In this book, he proposed a theory that accidents were unavoidable in systems that had certain properties, and nuclear power plants had these properties. In such systems, accidents would inevitably occur during the normal course of operations.

You don’t hear much about Perrow’s Normal Accident Theory these days, as it has been superseded by other theories in safety science, such as High Reliability Organizations and Resilience Engineering (although see Hopkins’s 2013 paper Issues in safety science for criticisms of all three theories). But even rejecting the specifics of Perrow’s theory, the idea of a normal accident or incident is a useful one.

An incident is an abnormal event. Because of this, we assume, reasonably, that an incident must have an abnormal cause: something must have gone wrong in order for the incident to have happened. And so we look to find where the abnormal work was, where it was that someone exercised poor judgment that ultimately led to the incident.

But incidents can happen as a result of normal work, when everyone whose actions contributed to the incident was actually exercising reasonable judgment at the time they committed those actions.

This concept, that all actions and decisions that contributed to an incident were reasonable in the moment they were made, is unintuitive. It requires a very different conceptual model of how incidents happen. But, once you adopt this conceptual model, it completely changes the way you understand incidents. You shift from asking “what was the abnormal work?” to “how did this incident happen even though everyone was doing normal work?” And this yields very different insights into how the system actually works, how it is that incidents don’t usually happen due to normal work, and how it is that they occasionally do.

Why LFI is a tough sell

There are two approaches to doing post-incident analysis:

  • the (traditional) root cause analysis (RCA) perspective
  • the (more recent) learning from incidents (LFI) perspective

In the RCA perspective, the occurrence of an incident has demonstrated that there is a vulnerability that caused the incident to happen, and the goal of the analysis is to identify and eliminate the vulnerability.

In the LFI perspective, an incident presents the organization with an opportunity to learn about the system. The goal is to learn as much as possible with the time that the organization is willing to devote to post-incident work.

The RCA approach has the advantage of being intuitively appealing. The LFI approach, by contrast, has three strikes against it:

  1. LFI requires more time and effort than RCA
  2. LFI requires more skill than RCA
  3. It’s not obvious what advantages LFI provides over RCA.

I think the value of LFI approach is based on assumptions that people don’t really think about because these assumptions are not articulated explicitly.

In this post, I’m going to highlight two of them.

Nobody knows how the system really works

The LFI approach makes the following assumption: No individual in the organization will ever have an accurate mental model about how the entire system works. To put it simply:

  • It’s the stuff we don’t know that bites us
  • There’s always stuff we don’t know

By “system” here, I mean the socio-technical system, which includes both the software and what it does, and the humans who do the work to develop and operate the system.

You’ll see the topic of incorrect mental models discussed in the safety literature in various ways. For example, David Woods uses the term miscalibration to describe incorrect mental models, and Diane Vaughan writes about structural secrecy, which is a mechanism that leads to incorrect mental models.

But incorrect mental models are not something we talk much about explicitly in the software world. The RCA approach implicitly assumes there’s only a single thing that we didn’t know: the root cause of the incident. Once we find that, we’re done.

To believe that the LFI approach is worth doing, you need to believe that there is a whole bunch of things about the system that people don’t know, not just a single vulnerability. And there are some things that, say, Alice knows that Bob doesn’t, and that Alice doesn’t know that Bob doesn’t know.

Better system understanding leads to better decision making in the future

The payoff for RCA is clear: the elimination of a known vulnerability. But the payoff for LFI is a lot fuzzier: if the people in the organization know more about the system, they are going to make better decisions in the future.

The problem with articulating the value is that we don’t know when these future decisions will be made. For example, the decision might happen when responding to the next incident (e.g., now I know how to use that observability tool because I learned from how someone else used it effectively in the last incident). Or the decision might happen during the design phase of a future software project (e.g., I know to shard my services by request type because I’ve seen what can go wrong when “light” and “heavy” requests are serviced by same cluster) or during the coding phase (e.g., I know to explicitly set a reasonable timeout because Java’s default timeout is way too high).

The LFI approach assumes that understanding the system better will advance the expertise of the engineers in the organization, and that better expertise means better decision making.

On the one hand, organizations recognize that expertise leads to better decision making: it’s why they are willing to hire senior engineers even though junior engineers are cheaper. On the other hand, hiring seems to be the only context where this is explicitly recognized. “This activity will advance the expertise of our staff, and hence will lead to better future outcomes, so it’s worth investing in” is the kind of mentality that is required to justify work like the LFI approach.