Useful knowledge and improvisation

Eric Dobbs recently retold a story on twitter (a copy is on his wiki) about one of his former New Relic colleagues, Nicholas Valler.

At the time, Nicholas was new to the company. He had just discovered a security vulnerability, and then (unrelated to that security vulnerability), an incident happened and, well, I encourage you to read the whole story first, and then come back to this post.

In the end, the engineers were able to leverage the security vulnerability to help resolve the incident. As is my wont, I made a snarky comment.

But I did want to make a more serious comment about what this story illustrates. In a narrow sense, this security vulnerability helped the New Relic engineers remediate before there was severe impact. But in a broader sense, the following aspects helped them remediate:

  • they had useful knowledge of some aspect of the the system (port 22 was open to the world)
  • they could leverage that knowledge to improvise a solution (they could use this security hole to log in and make changes to the kafka configuration)

The irony here is that it was a new employee that had the useful knowledge. Typically, it’s the tenured engineers who have this sort of knowledge, as they’ve accumulated it with experience. In this case, the engineer discovered this knowledge right before it was needed. That’s what make this such a great story!

I do think that how Nicholas found it, by “poking around”, is a behavior that comes with general experience, even though he didn’t have much experience at the company.

But being in possession of useful knowledge isn’t enough. You also need to be able to recognize when the knowledge is useful and bring it to bear.

These two attributes: having useful knowledge about the system and the ability to apply that knowledge to improvise a solution, are critical for being able to deal effectively with incidents. Applying these are resilience in action.

It’s not a focus of this particular story, but, in general, this sort of knowledge is distributed across individuals. This means that it’s the ad-hoc team that forms during an incident that needs to possess these attributes.

Remembering the important bits when you need them

I’m working my way through the Cambridge Handbook of Expertise and Expert Performance, which is a collection of essays from academic researchers who study expertise.

Chapter 6 discusses the ability of experts to recall information that’s relevant to the task at hand. This is one of the differences between experts and novices: a novice might answer questions about a subject correctly on a test, but when faced with a real problem that requires that knowledge, they aren’t able to retrieve it.

The researchers K. Anders Ericsson and Walter Kintsch had an interesting theory about how experts do better at this than novices. The theory goes like this: when an expert encounters some new bit of information, they have the ability to encode that information into their long-term memory in association with a collection of cues of when that information would be relevant.

In other words, experts are able to predict the context when that information might be relevant in the future, and are able to use that contextual information as a kind of key that they can use to retrieve the information later on.

Now, think about reading an incident write-up. You might learn about a novel failure mode in some subsystem your company uses (say, a database), as well as the details that led up to it happening, including some of the weird, anomalous signals that were seen earlier on. If you have expertise in operations, you’ll encode information about the failure mode into your long term memory and associate it with the symptoms. So, the next time you see those symptoms in production, you’ll remember this failure mode.

This will only work if the incident write-up has enough detail to provide you with the cues that you need to encode in your memory. This is another reason to provide a rich description of the incident. Because the people reading it, if they’re good at operations, will encode the details of the failure mode into their memory. If it happens again, and they read the write up, they’ll remember.

Root cause of failure, root cause of success

Here are a couple of tweets from John Allspaw.

Succeeding at a project in an organization is like pushing a boulder up a hill that is too heavy for any single person to lift.

A team working together to successfully move a boulder to the top of the hill

It doesn’t make sense to ask what the “root cause of success” is for an effort like this, because it’s a collaboration that requires the work of many different people to succeed. It’s not meaningful to single out a particular individual as the reason the boulder made it to the top.

Now, let’s imagine that the team got the boulder to the top of the hill, and balanced it precariously at the summit, maybe with some supports to keep it from tumbling down again.

The boulder made it to the top!

Next, imagine that there’s a nearby baseball field, and some kid whacks a fly ball that strikes one of the supports, and the rock tumbles down.

In comes the ball, down goes the boulder

This, I think, is how people tend to view failure in systems. A perturbation comes along, strikes the system, and the system falls over. We associate the root cause with this perturbation.

In a way, our systems are like a boulder precariously balanced at the top of a hill. But this view is incomplete. Because what’s keeping the complex system boulder balanced is not a collection of passive supports. Instead, there are a number of active processes, like a group of people that are constantly watching the boulder to see if it starts to slip, and applying force to keep it balanced.

A collection of people watching the boulder and pushing on it to keep it from falling

Any successful complex system will have evolved these sorts of dynamic processes. These are what keep the system from falling over every time a kid hits a stray ball.

Note that it’s not the case that all of these processes have to be working for the boulder to stay up. The boulder won’t fall just because someone let their guard down for a moment, or even if one person happened to be absent one day; the boulder would never stay up if it required everyone to behave perfectly all of the time. Because it’s a group of people keeping it balanced, there is redundancy: one person can compensate for another person who falters.

But this keeping-the-boulder-balanced system isn’t perfect. Maybe something comes out of the sky and strikes the boulder with an enormous amount of force. Or maybe several people are sluggish today because they’re sick. Or maybe it rained and the surface of the hill is much slipperier, making it more difficult to navigate. Maybe it’s a combination of all of these.

When the boulder falls, it means that the collection of processes weren’t able to compensate for the disturbance. But there’s no single problem, no root cause, that you can point to, because it’s the collection of these processes working together that normally keep the boulder up.

This is why “root cause of failure” doesn’t make sense in the context of complex systems failure, because a collection of control processes keep the system up and running. A system failure is a failure of this overall set of processes. It’s just not meaningful to single out a problem with one of these processes after an incident, because that process is just one of many, and it failing alone couldn’t have brought down the system.

What makes things even trickier is that some of these processes are invisible, even to the people inside of the system. We don’t see the monitoring and adjustment that is going on around us. Which means we won’t notice if some of these control processes stop happening.

Burned by ‘let it burn’

Here are some excerpts from a story from the L.A. Times, with the headline: Forest Service changes ‘let it burn’ policy following criticism from western politicians (emphasis mine)

Facing criticism over its practice of monitoring some fires rather than quickly snuffing them out, the U.S. Forest Service has told its firefighters to halt the policy this year to better prioritize resources and help prevent small blazes from growing into uncontrollable conflagrations.

The [Tamarack] fire began as a July 4 lightning strike on a single tree in the Mokelumne Wilderness, a rugged area southeast of Sacramento. Forest officials decided to monitor it rather than attempt to put it out, a decision a spokeswoman said was based on scant resources and the remote location. But the blaze continued to grow, eventually consuming nearly 69,000 acres, destroying homes and causing mass evacuations. It is now 82% contained.

Instead of letting some naturally caused small blazes burn, the agency’s priorities will shift this year, U.S. Forest Service Chief Randy Moore indicated to the staff in a letter Monday. The focus, he said, will be on firefighter and public safety.

The U.S. Forest Service had to make a call about whether to put out a fire or to monitor it and let it burn out. In this case, they decided to monitor it, and the fire grew out of control.

Now, imagine an alternate universe where the Forest Service spent some of its scant resources on putting out this fire, and then another fire popped up somewhere else, and they didn’t have the resources to fight that one effectively, and it went out of control. The news coverage would, undoubtedly, be equally unkind.

Practitioners often must make risk trade-offs in the moment, when there is a high amount of uncertainty. What was the risk that the fire would grow out of control? How does it stack up against the risk of being short staffed if you send out firefighters to put out a small fire and a large one breaks out elsewhere?

Towards the middle of the piece, the article goes into some detail about the issue of limited resources.

[Agriculture Secretary Tom] Vilsack promised more federal aid and cooperation for California’s plight, acknowledging concerns about past practices while also stressing that, with dozens of fires burning across the West and months to go in a prolonged fire season, there are not enough resources to put them all out.

“Candidly I think it’s fair to say, over the generations, over the decades, we have tried to do this job on the cheap,” Vilsack said. “We’ve tried to get by, a little bit here, a little bit there, a little forest management over here, a little fire suppression over here. But the reality is this has caught up with us, which is why we have an extraordinary number of catastrophic fires and why we have to significantly beef up our capacity.”

Vilsack said that the bipartisan infrastructure bill working its way through Congress would provide some of those resources but that ultimately it would take “billions” of dollars and years of catch-up to create fire-resilient forests.

The U.S. Forest Service’s policy on allowing unplanned wildfires to burn differs from the California Department of Forestry and Fire Protection, and I’m not a domain expert, so I don’t have an informed opinion. But this isn’t just a story about policy, it’s a story about saturation. It’s also about what’s allowed (and not allowed) to count as a cause.

Controlling a process we don’t understand

I was attending the Resilience Engineering Association – Naturalistic Decision Making Symposium last month, and one of the talks was by a medical doctor (an anesthesiologist) who was talking about analyzing incidents in anesthesiology. I immediately thought of Dr. Richard Cook, who is also an anesthesiologist, who has been very active in the field of resilience engineering, and I wondered, “what is it with anesthesiology and resilience engineering?” And then it hit me: it’s about process control.

As software engineers in the field we call “tech”, we often discuss whether we are really engineers in the same sense that a civil engineer is. But, upon reflection I actually think that’s the wrong question to ask. Instead, we should consider the fields there where practitioners are responsible for controlling a dynamic process that’s too complex for humans to fully understand. This type of work involves fields such as spaceflight, aviation, maritime, chemical engineering, power generation (nuclear power in particular), anesthesiology, and, yes, operating software services in the cloud.

We all have displays to look at to tell us the current state of things, alerts that tell us something is going wrong, and knobs that we can fiddle with when we need to intervene in order to bring the process back into a healthy state. We all feel production pressure, are faced with ambiguity (is that blip really a problem?), are faced with high-pressure situations, and have to make consequential decisions under very high degrees of uncertainty.

Whether we are engineers or not doesn’t matter. We’re all operators doing our best to bring complex systems under our control. We face similar challenges, and we should recognize that. That is why I’m so fascinated by fields like cognitive systems engineering and resilience engineering. Because it’s so damned relevant to the kind of work that we do in the world of building and operating cloud services.

Incident writeup as sociological storytelling

Back when Game of Thrones was ending, the sociology professor Zeynep Tufekci wrote an essay titled The Real Reason Fans Hate the Last Season of Game of Thrones. Up until the last season, Game of Thrones was told as a sociological story. Even though the show followed individual characters, the story wasn’t about those characters as individuals. Rather, it was a story about larger systems, such as society, norms, external events, and institutions, told through these characters. The sociological nature of the story was how the series maintained cohesion even though major characters died so often. In the last season, the showrunners switched to telling psychological stories, about the individual characters.

A couple of weeks ago, I wrote a blog post called Naming names in incident writeups. My former colleague Nora Jones expressed similar sentiments in her recent o11ycon keynote:

A good incident writeup is a sociological story about our system. Yes, there are individual engineers who were involved in the incident, but their role in the writeup is to serve as a narrative vehicle for telling that larger story. We care about those engineers (they are our colleagues!), but it’s the system that the story is about. As Tufekci puts it:

The hallmark of sociological storytelling is if it can encourage us to put ourselves in the place of any character, not just the main hero/heroine, and imagine ourselves making similar choices. “Yeah, I can see myself doing that under such circumstances” is a way into a broader, deeper understanding. It’s not just empathy: we of course empathize with victims and good people, not with evildoers.

But if we can better understand how and why characters make their choices, we can also think about how to structure our world that encourages better choices for everyone. The alternative is an often futile appeal to the better angels of our nature. It’s not that they don’t exist, but they exist along with baser and lesser motives. The question isn’t to identify the few angels but to make it easier for everyone to make the choices that, collectively, would lead us all to a better place.

Dealing with new kinds of trouble

The system is in trouble. Maybe a network link has gotten saturated, or a bad DNS configuration got pushed out. Maybe the mix of incoming requests suddenly changed and now there are a lot more heavy requests than light ones, and autoscaling isn’t helping. Perhaps a data feed got corrupted and there’s no easy way to bring the affected nodes back into a good state.

Whatever the specific details are, the system has encountered a situation that it wasn’t designed to handle. This is when the alerts go off and the human operators get involved. The operators work to reconfigure the system to get through the trouble. Perhaps they manually scale up a cluster that doesn’t scale automatically, or they recycle nodes, or make some configuration change or redirect traffic to relieve pressure from some aspect of the system.

If we think about the system in terms of the computer-y parts, the hardware and the software, then it’s clear that the system couldn’t handle this new failure mode. If it could, the humans wouldn’t have to get involve.

We can broaden our view of the system to also include the humans, sometimes known as the socio-technical system. In some cases, the socio-technical system is actually designed to handle cases that the software system alone can’t: these are the scenarios that we document in our runbooks. But, all too often, we encounter a completely novel failure mode. For the poor on-call, there’s no entry in the runbook that describes the steps to solve this problem.

In cases where the failure is completely novel, the human operators have to improvise: they have to figure out on the fly what to do, and then make the relevant operational changes to the system.

If the operators are effective, then even though the socio-technical system wasn’t designed to function properly in this face of this new kind of trouble, the people within the system make changes that result in the overall system functioning properly again.

It is this capability of a system, its ability to change itself when faced with a novel situation in order to deal effectively with that novelty, that David Woods calls graceful extensibility.

Here’s how Woods defines graceful extensibility in his paper: The Theory of Graceful Extensibility: Basic rules that govern adaptive systems:

Graceful extensibility is the opposite of brittleness, where brittleness is a sudden collapse or failure when events push the system up to and beyond its boundaries for handling changing disturbances and variations. As the opposite of brittleness, graceful extensibility is the ability of a system to extend its capacity to adapt when surprise events challenge its boundaries.

This idea is a real conceptual leap for those of us in the software world, because we’re used to thinking about the system only as the software and the hardware. The idea of a system like that adapting to a novel failure mode is alien to us, because we can’t write software that does that. If we could, we wouldn’t need to staff on-call rotations.

We humans can adapt: we can change the system, both the technical bits (e.g., changing configuration) and the human bits (e.g., changing communication patterns during an incident, either who we talk to or the communication channel involved).

However, because we don’t think of ourselves as being part of the system, when we encounter a novel failure mode, and then the human operators step in and figure out how to recover, our response is typically, “the system could not handle this failure mode (and so humans had to step in)”.

In one sense, that assessment is true: the system wasn’t designed to handle this failure mode. But in another sense, when we expand our view of the system to include the people, an alternate response is, “the system encountered a novel failure mode and we figured out how to make operational changes to make the system healthy again.

We hit the boundary of what our system could handle, and we adapted, and we gracefully extended that boundary to include this novel situation. Our system may not be able to deal with some new kind of trouble. But, if the system has graceful extensibility, then it can change itself when the new trouble happens so it can deal with the trouble.

Subverting the process

Recently, Salesforce released a public incident writeup for a service outage that happened in mid-May. There’s a lot of good stuff in here (DNS! A config change!), but I want to focus on one aspect of the writeup, a contributing factor described in the writeup as Subversion of the Emergency Break Fix (EBF) process.

Here are some excerpts from that section of the writeup (emphasis in the original):

An [Emergency Break Fix] is an unplanned and urgent change that is required to prevent or remediate a Severity-0, a Severity-1, or a Severity-2 incident… Non-urgent changes, i.e. those which do not require immediate attention, should not be deployed as EBFs.

In this situation, there was no active or imminent Severity-0, Severity-1 or Severity-2 incident, so the EBF process should not have been used, and standard Salesforce stagger processes should not have been ignored. 

By following an emergency process, this change avoided the extensive review scrutiny that would have occurred had it been made as a standard change under the Salesforce Change Traffic Control (CTC) process. … In this case, the engineer subverted the known policy and the appropriate disciplinary action has been taken to ensure this does not happen in the future.

What was the engineer thinking? “ a reader wonders. I certainly did. People make decisions for reasons that make sense to them. I have no idea what the engineer’s reasoning was here, because there’s not even a hint of that reasoning alluded to here.

Is this process commonly circumvented by engineers for some reason? (i.e., was this situation actually more common than the writeup lets on?) Alternately, was the engineer facing atypical time pressure? If so, what was the nature of the time pressure?

One of the functions of public writeups is to give customers confidence in the organization’s ability to deal with future incidents. This section had the opposite effect, it filled me with dread. It communicates to me that the organization is not interested in understanding how actual work is done.

Woe be it to the next engineer caught in the double bind where there will be consequences if they don’t work quickly enough and there will be consequences if they don’t conform to a process that slows them down so much that they can’t get their work done quickly enough.

Naming names in incident writeups

In a recent Twitter thread, Alex Hidalgo from Nobl9 made the following observation about his incident reports:

I take the opposite approach: I never write any of my reports anonymously. Instead, I explicitly specify the names of all of the people involved. I wanted to write a post on why I do that.

I understand the motivation for providing anonymity. We feel guilt and shame when our changes contribute to an incident. The safety literature refers to this as second victim phenomenon. We don’t write down an engineer’s name in a report because we don’t want to exacerbate the second victim effect. Also, the incident is about the system, not the particular engineer.

The reason I take the opposite approach of naming names is that I want to normalize the fact that incidents are aspects of the system, not the individuals. I feel like providing anonymity implicitly sends the signal that “the names are omitted to protect the guilty.”

My strategy in doing these writeups is to lean as heavily as I can into demonstrating to the reader that all actions taken by the engineers involved were reasonable in the moment. I want them to read the writeup and think, “This could have been me!”. I want to try to get the organization to a point where there is no shame in contributing to an incident, it’s an inevitable aspect of the work that we do.

In order to do this well, I try to write these up as much as possible from the perspective of the people involved. I find it really helps make the writeups look less judge-y (“normative”, in the jargon) by telling the story from the perspective of the individual, and calling attention to the systemic aspects.

And so, while I think Alex and I are both trying to get to the same place, we’re taking different routes.

Incident analysis as guerrilla case study research

Today I tweeted this:

To which Sasha Rosenbaum asked:

This post is my response.

We seldom have time for introspection at work. If we’re lucky, we have the opportunity to do some kind of retrospective at the end of a project or sprint. But, generally speaking, we’re too busy working to spend time examining that work.

One exception to this is incidents: organizations are willing to spend effort on introspection after an incident happens. That’s because incidents are unsettling: people feel uneasy that the system failed in a way they didn’t expect.

And so, an organization is willing to spend precious engineering cycles in order to rid itself of the uneasy feeling that comes with a system failing unexpectedly. Let’s make sure this never happens again.

Incident analysis, in the learning from incidents in software (LFI) sense, is about using an incident as an opportunity to get a better understanding of how the overall system works. It’s a kind of case study, where the case is the incident. The incident acts as a jumping-off point for an analyst to study an aspect of the system. Just like any other case study, it involves collecting and synthesizing data from multiple sources (e.g., interviews, chat transcripts, metrics, code commits).

I call it a guerrilla case study because, from the organization’s perspective, the goal is really to get closure, to have a sense that all is right with the world. People want to get to a place where the failure mode is now well-understood and measures will be put in place to prevent it from happening again. As LFI analysts, we’re exploiting this desire for closure to justify spending time examining how work is really done inside of the system.

Ideally, organizations would recognize the value of this sort of work, and would make it explicit that the goal of incident analysis is to learn as much as possible. They’d also invest in other types of studies that look into how the overall system works. Alas, that isn’t the world we live in, so we have to sneak this sort of work in where we can.