Have you seen this before?

Whenever I interview someone after an incident, a question I try to always ask is “have you ever seen a failure mode like this before?

If the engineer says, “yes”, then I will ask follow-up questions about what happened the last time they encountered something similar, and how long ago that happened. Experienced engineers’ perceptions are shaped by…well…their experiences, and learning about how they encountered a similar issue previously helps me understand how they reacted this time (e.g., why they looked in a log file for a particular error message, or why the reached out to a specific individual over Slack).

If the engineer says “no”, that tells me that the engineer was facing a novel failure mode. This is also a useful bit of context, because I want to learn how expert engineers deal with situations they haven’t previously encountered. How do they try to make sense of these signals they don’t recognize? Where do they look to gather more information? Who do they reach out to?

This is the sort of information that people are happy to share with you, but you have to ask them for it, because they’re unlikely to share it spontaneously unless you ask the right questions, because they don’t realize how relevant it is to understanding the incident.

There is no escape from the adaptive universe

If I had to pick just one idea from the field of resilience engineering that has influenced me the most, it would be David Woods’s notion of the adaptive universe. In his 2018 paper titled The theory of graceful extensibility: basic rules that govern adaptive systems, Woods describes the two assumptions [1] of the adaptive universe:

  1. Resources are always finite.
  2. Change is ongoing.

That’s it! Just two simple assertions, but so much flows from them.

At first glance, the assumptions sound banal. Nobody believes in infinite resources! Nobody believes that things will stop changing! Yet, when we design our systems, it’s remarkable how often we don’t take these into account.

The future is always going to involve changes to our system that we could not foresee at design time, and those changes are always going to be made in a context where we are limited in resources (e.g., time, headcount) and hence will have to make tradeoffs. Instead, we tell ourselves a story about how next time, we’re going to build it right. But, we aren’t, because the next time we’ll also be resource constrained, and so we’ll have to make some decisions for reasons of expediency. And the next time, the system will also change in ways we could never have predicted, invalidating our design assumptions.

Because we are forever trapped in the adaptive universe.

[1] If you watch Woods’s online resilience engineering short course, which precedes this paper, he mentions a third property: surprise is fundamental. But I think this property is a consequence of the first two assumptions rather than requiring an additional assumption, and I suspect that’s why he doesn’t mention it as an assumption in his 2018 paper.

There is no escape from Ashby’s Law

[V]ariety can destroy variety

W. Ross Ashby

There are more things in heaven and earth, Horatio,
Than are dreamt of in your philosophy.

Hamlet (1.5.167-8)

In his book An Introduction to Cybernetics, published in 1956, the English psychiatrist W. Ross Ashby proposed the Law of Requisite Variety. His original formulation isn’t easy to extract into a blog post, but the Principia Cybernetica website has a pretty good definition:

The larger the variety of actions available to a control system, the larger the variety of perturbations it is able to compensate.

Like many concepts in systems thinking, the Law of Requisite Variety is quite abstract, which makes it hard to get a handle on. Here’s a concrete example I find useful for thinking about it.

Imagine you’re trying to balance a broomstick on your hand:

This is an inherently unstable system, and so you have to keep moving your hand around to keep the broomstick balanced, but you can do it. You’re acting as a control system to keep the broomstick up.

If you constrain the broomstick to have only one degree of freedom, you have what’s called the inverted pendulum problem, which is a classic control systems problem. Here’s a diagram:

From the Wikipedia Inverted pendulum article

The goal is to move the cart in order to keep the pendulum balanced. If you have sensor information that measures the tilt angle, θ, you can use that data to build a control system to push on the cart in order to keep the pendulum from falling over. Information about the tilt angle is part of the model that the control system has about the physical system it’s trying to control.

Now, imagine that the pendulum isn’t constrained to only one degree of freedom, but it now has two degrees of freedom: this is the situation when you’re balancing a broom on your hand. There are now two tilt angles to worry about: it can fall towards/away from your body, or it can fall left/right.

You can’t use the original inverted pendulum control system to solve this problem, because that only models one of the tilt angles. Imagine you can only move your hand forward and back, but not left or right. Because of this, the control system won’t be able to correct for the other angle: the pendulum will fall over.

The problem is that the new system can vary in ways that the control system wasn’t designed to handle: it can get into states that aren’t modeled by the original system.

This is what the Law of Requisite Variety is about: if you want to build a control system, the control system needs to be able to model every possible state that the system being controlled can get into: the state space of the control system has to be at least as large as the state space of the physical system. If it isn’t, then the physical system can get into states that the control system won’t be able to deal with.

Bringing this into the software world: when we build infrastructure software, we’re invariably building control systems. These control systems can only handle situations that it is designed for. We invariably run into trouble when the systems we build get into states that the designer never imagined happening. A fun example of this case is some pathological traffic pattern.

The fundamental problem with building software control systems is that we humans aren’t capable of imagining all possible states that the systems being controlled can get into. In particular, we can’t imagine the changes that people are going to make in the future that will create new states that we simply could not ever imagine needing to handle. And so, our control systems will invariably be inadequate, because they won’t be able to handle these situations. The variety of the world exceeds the variety our control systems are designed to handle.

Fortunately, we humans are capable of conceiving of a much wider variety of system states than the systems we build. That’s why, when our software-based control systems fail and the humans get paged in, the humans are eventually able to make sense of what state the system has gotten itself into and put things right.

Even we humans are not exempt from Ashby’s Law. But we can revise of our (mental) models of the system in ways that our software-based control systems cannot, and that’s why we can deal effectively with incidents. Because of how we can update our models, we can adapt where software cannot.

The downsides of expertise

I’m a strong advocate of the value of expertise to a software organization. I’d even go so far as to say that expertise is a panacea.

Despite the value of expertise, there are two significant obstacles to organizations to leverage expertise as effectively as possible.

Expertise is expensive to acquire

Developing expertise is expensive for an organization to acquire. Becoming an expert requires experience, which takes time and effort. An organization can hire for some forms of expertise, but no organization can hire someone who is already an expert in the org’s socio-technical system. And a lot of the value for an organization is having expertise in the behaviors of the local system.

You can transfer expertise from one person to another, but that also takes time and effort, and you need to put mechanisms in place to support this. Apprenticeship and coaching are two traditional methods of expertise transfer, but also aren’t typically present in software organizations. I’m an advocate of learning from incidents as a medium for skill transfer, but that requires its own expertise for doing incident investigation in a way that supports skill transfer.

Alas, you can’t transfer expertise from a person to a tool, as John Allspaw notes, so we can’t take a shortcut by acquiring sophisticated tooling. AI researchers tried building such expert systems in the 1980s, but these efforts failed.

Concentrated expertise is dangerous

Organizations tend to foster local experts: a small number of individuals who have a lot of expertise with aspects of the local system. These people are enormously valuable to organizations (they’re often very helpful during incidents), but they represent single points of failure. If these individuals happen to be out of the office during a critical incident, or if they leave the company, it can be very costly to the organization. My former colleague Nora Jones calls this the islands of knowledge problem.

What’s worse, high concentration of expertise can become a positive feedback loop. If there’s a local expert, then other individuals may use the expert as a crutch, relying on the expert to solve the harder problems and never putting in the effort to develop their own expertise.

To avoid this problem, we need to develop the expertise in more people within the organization, which, is as mentioned earlier, is expensive.

I continue to believe that it’s worth it.

Getting into people’s heads: how and why to fake it

With apologies to David Parnas and Paul Clements.

To truly understand how an incident unfolded, you need to experience the incident from the perspectives of the people who were directly involved in it: to see what they saw, think what they thought, and feel what they felt. Only then can you understand how they came to their conclusions and made their decisions.

The problem is that we can’t ever do that. We simply don’t have direct access to the minds of the people who were involved. We can try to get at some of this information: we can interview them as soon as possible after the incident and ask the kinds of questions that are most likely to elicit information about what they remember seeing, thinking, or feeling. But this account will always be inadequate: memories are fallible, interviewing time is finite, and we’ll never end up asking all of the right questions, anyways.

Even though we can’t really capture the first-hand experiences of the people involved in the incident, I still think it’s a good idea to write the narrative as if we are able to do so. When I’m writing the narrative description, I try to write each section from the perspective of one person that was directly involved, describing things from that person’s point of view, rather than taking an omniscient third-person perspective.

The information in these first-hand accounts is based on my interviews with the people involved, and they review them for accuracy, so it isn’t a complete fiction, but neither is it ever really the truth of what happened in the moment, because that information is forever inaccessible.

Instead, the value of this sort of first-hand narrative account is to force the reader to experience the incident from the perspectives of individuals involved. The only way to make sense of an incident is to try to understand the world as seen from the local perspectives of the individuals involved. Writing it up this way encourages the reader to see things this way. It’s a small lie that serves a greater truth.

Conveying confusion without confusing the reader

Confusion is a hallmark of a complex incident. In the moment, we know something is wrong, but we struggle to make sense of the different signals that we’re seeing. We don’t understand the underlying failure mode.

After the incident is over and the engineers have had a chance to dig into what happened, these confusing signals make sense in retrospect. We find out that about the bug or inadvertent config change or unexpected data corruption that led to the symptoms we saw during the incident.

When writing up the narrative, the incident investigator must choose whether to inform the reader in advance about the details of the failure mode, or to withhold this info until the point in time in the narrative when the engineers involved understood what was happening.

I prefer the first approach: giving the reader information about the failure mode details in the narrative before the actors involved in the incident have that information. This enables the reader to make sense of the strange, anomalous signals in a way that the engineers in the moment were not able to.

I do this because, as a reader, I don’t enjoy the feeling of being confused: I’m not looking for a mystery when I read a writeup. If I’m reading about a series of confusing signals that engineers are looking at (e.g., traffic spikes, RPC errors), and I can’t make sense of them either, I tend to get bored. It’s just a mess of confusion.

On the other hand, if I know why these signals are happening, but the characters in the story don’t know, then that is more effective in creating tension in my mind. I want to read on to resolve the tension, to figure out how the engineers ended up diagnosing the problem.

When informing the reader about the failure mode in advance, the challenge is to avoid infecting the reader with hindsight bias. If the reader thinks, “the problem was obviously X. How could they not see it?”, then I’ve failed in the writeup. What I try to do is put the reader into the head of the people involved as much as possible: to try to convey the confusion they were experiencing in the moment, and the source of that confusion.

By enabling the reader to identify with the people involved, you can communicate to the reader how confusing the situation was to the people involved, without directly inflicting that same confusion upon them.

Climbing the mountain

When I was in high school, I attended a Jewish weekend retreat in the Laurentian Mountains of Quebec1. While most of the attendees were secular Jews like me, one of them was a Chabadnik, and several us got into a discussion about Judaism and scholarship.

One of the secular Jews lamented that it was an insurmountable task to properly understand Judaism: there were just too many texts you had to study. If we were lucky, we knew a little Hebrew, but certainly not enough to study the Hebrew texts (let alone the texts in other languages!).

The Chabadnik offered the following metaphor. Imagine a mountain, with an impossibly high peak. Studying Judaism is like climbing the mountain. People who have previously studied material will be higher up on the mountain than those who haven’t studied as much. However, regardless of your current elevation, you can always climb higher than where you are, by studying material appropriate for your level.

So it is with learning more about resilience engineering. Fortunately for those who seek to learn more about resilience, it’s a much younger field than Judaism. You need contend with only decades of scholarship, rather than centuries. Still, being confronted with decades of research papers can be intimidating. But don’t let that stop you from trying to learn just a little bit more than you currently know.

I once heard Richard Cook say that the most effective way to get better at analyzing incidents was to first study how incidents happen in a field other than your own. Most of us will never have the opportunity to devote years of study to a different field! On the other hand, he also said that having a ten-to-fifteen-minute huddle after an incident to discuss what happened can also be a very effective learning mechanism.

You don’t need to read mountains of papers to start getting better at learning from incidents. It can be as simple as asking different kinds of questions in retrospectives (e.g., “When you saw the alert go off, what did you do next?”). One of the things I really like about resilience engineering is how it values expertise borne out of experience. I think you’ll learn more by trying out different questions to ask in incident retros than you will from reading the papers. (Although reading the papers will eventually help you ask better questions).

Diane Vaughan, a sociology researcher, spent six years studying a single incident! That’s a standard that none of us can hope to meet. And that means we won’t obtain the depth of insight that Vaughan was able to in her investigation, but that’s ok.

Don’t be intimidated by the height of the mountain. Don’t worry about reaching the top (there isn’t one), or even reaching a certain height. The important thing is to ascend: to work to climb higher than you currently are.

1 I attended a Jewish elementary school, but a public high school. In high school, my parents encouraged me to attend these sorts of programs to maintain some semblance of Jewish identity.

Taking a risk versus running a risk

In the wake of an incident, we can often identify a risky action that was taken by an engineer that contributed to the incident. However, actions that look risky to us in retrospect didn’t necessarily look risky to the engineer who took the action in the moment. In the SINTEF A17034 report on Organizational Accidents and Resilient Organisations: Six Perspectives, the authors draw a distinction between taking a risk and running a risk.

When you take a risk, you are taking an action that you know to be risky. When an engineer says they are YOLO’ing a change, they’re taking a risk.

On the other hand, running a risk refers to taking a course of action that is not believed to be risky. These are the kinds of actions that we only categorize as risky in hindsight, when we have more information than the engineer who took the course of action in the moment.

Sometimes we deliberately take a risk because we believe there is greater risk if we don’t take action. But running a risk is never deliberate, because we didn’t know the risk was there in the first place.

Stories as a vehicle for learning from the experience of others

Senior software engineering positions command higher salaries than junior positions. The industry believes (correctly, I think) that engineers become more effective as they accumulate experience, and that perception is reflected in market salaries.

Learning from direct experience is powerful, but there’s a limit to the rate at which we can learn from our own experiences. Certainly, we learn more from some experiences than others; we joke about “ten years of experience” versus “one year of experience ten times over”, as well as using scars as a metaphor for these sometimes unpleasant but more impactful experiences. But there’s only so many hours in a day, and we may not always be…errr… lucky enough to be exposed to many high-value learning opportunities.

There’s another resource we can draw on besides our own direct experience, and that’s the experiences of peers in our organization. Learning from the experiences of others isn’t as effective as learning directly from our own experience. But, if the organization you work in is large enough, then high-value learning opportunities are probably happening around you all of the time.

Given these opportunities abound, the challenge is: how can we learn effectively from the experiences of others? One way that humans learn from others is through telling stories.

Storytelling enables a third person to experience events by proxy. When we tell a story well, we run a simulation of the events in the mind of the listener. This kind of experience is not as effective as the first-hand kind, but it still leaves an impression on the listener when done well. In addition, storytelling scales very well: we can write down stories, or record them, and then publish these across the organization.

A second challenge is: what stories should we tell? It turns out that incidents make great stories. You’ll often hear engineers tell tales of incidents to each other. We sometimes calling these war stories, horror stories (the term I prefer), or ghost stories.

Once we recognize the opportunity of using incidents as a mechanism for second-hand-experiential-learning-through-storytelling, this shifts our thinking about the role and structure of an incident writeup. We want to tell a story that captures the experiences of the people involved in the incident, so that the readers can imagine what is was like, in the moment, when the alerts were going off and confusion reigned.

When we want to use incidents for second-hand experiential learning, it shifts the focus of an incident investigation away from action items as being the primary outcome and towards the narrative, the story we want tell.

When we hire for senior positions, we don’t ask candidates to submit a list of action items for tasks that could improve our system. We believe the value of their experience lies in them being able to solve novel problems in the future. Similarly, I don’t think we should view incident investigations as being primarily about generating action items. If, instead, we view them as an opportunity to learn collectively from the experiences of individuals, then more of us will get better at solving novel problems in the future.