Incident writeup as sociological storytelling

Back when Game of Thrones was ending, the sociology professor Zeynep Tufekci wrote an essay titled The Real Reason Fans Hate the Last Season of Game of Thrones. Up until the last season, Game of Thrones was told as a sociological story. Even though the show followed individual characters, the story wasn’t about those characters as individuals. Rather, it was a story about larger systems, such as society, norms, external events, and institutions, told through these characters. The sociological nature of the story was how the series maintained cohesion even though major characters died so often. In the last season, the showrunners switched to telling psychological stories, about the individual characters.

A couple of weeks ago, I wrote a blog post called Naming names in incident writeups. My former colleague Nora Jones expressed similar sentiments in her recent o11ycon keynote:

"Blamelessness: a lot of orgs think it's about being nice and not naming names. That's really not the case. Instead, it's about making it a safe enough space to come forward with information and _to_ name names." —@nora_js #o11ycon
— Liz Fong-Jones (方禮真) (@lizthegrey) June 9, 2021

A good incident writeup is a sociological story about our system. Yes, there are individual engineers who were involved in the incident, but their role in the writeup is to serve as a narrative vehicle for telling that larger story. We care about those engineers (they are our colleagues!), but it’s the system that the story is about. As Tufekci puts it:

The hallmark of sociological storytelling is if it can encourage us to put ourselves in the place of any character, not just the main hero/heroine, and imagine ourselves making similar choices. “Yeah, I can see myself doing that under such circumstances” is a way into a broader, deeper understanding. It’s not just empathy: we of course empathize with victims and good people, not with evildoers.
But if we can better understand how and why characters make their choices, we can also think about how to structure our world that encourages better choices for everyone. The alternative is an often futile appeal to the better angels of our nature. It’s not that they don’t exist, but they exist along with baser and lesser motives. The question isn’t to identify the few angels but to make it easier for everyone to make the choices that, collectively, would lead us all to a better place.

Grappling with contingency

For want of a nail the shoe was lost.
For want of a shoe the horse was lost.
For want of a horse the rider was lost.
For want of a rider the message was lost.
For want of a message the battle was lost.
For want of a battle the kingdom was lost.
And all for the want of a horseshoe nail.

Contingency is the idea that history could easily have turned out completely different, if only certain minor events had happened differently. If there was a zig somewhere instead of a zag, maybe the election would have gone the other way, or the outcome of the revolution would have be different.

In terms of the work we do, contingency means that the success of projects, or the length of an incident, may vary dramatically based on happenstance. Maybe someone happened to be out sick one day and missed a critical meeting, and so didn’t have a certain important bit of information or wasn’t able to give feedback on a design. Maybe someone on the team happened to have prior experience with just the sort of problem that they are all grappling with.

When we look back on successes and failures, they feel inevitable somehow, like there were an inexorable set of forces pushing in the direction that led to the success or failure. You can see that in incident retrospectives in particular, as people search for the cause, the essential reason this happened.

We’re uncomfortable with contingency, preferring essentialism. That’s why so often commit the fundamental attribution error: I snapped at you because I missed lunch, which put me in a grouchy mood; you snapped at me because you’re a hot-tempered jerk.

So, while we do have influence over outcomes, much depends on, well, chance. The difference between success and failure might hinge on the occurrence of a random hallway conversation where we pick up an extra bit of context, or whether our kid has a fever on a particular day and we need to take them to see the doctor.

The disaster meeting

This post is mostly an excerpt from the book Designing Engineers by Louis Bucciarelli. This book describes Bucciarelli’s observational study of engineers doing design work at three different engineering companies.

At some point, I’ll write a proper review of the book, but I wanted to highlight a specific passage, a meeting among engineers working to solve a specific problem.

The engineers attending this meeting work at a company that sells photograph processing machines. The company is planning on releasing a new product (“Atlas”) in a few months, but there’s a problem with the design: a phenomenon that they call dropout. Dropout happens when parts of the image that are barely visible end up not getting printed on the paper. The problem can be hard to notice unless someone looks very closely at the photo, but it’s enough of an issue that they are putting in resources to solving it.

This meeting is being led by Sergio, the engineer leading the effort to solve the dropout problem. Before this meeting, he identified fourteen potential solutions to the dropout problems. He’s called this engineering meeting in order to apply a structured decision-making process (the Pugh Method) to help him narrow down this list to the most promising-sounding solutions.

The meeting does not go as the organizer hoped. The transcript is long-ish, but worth reading in full. You might even find it familiar.

Sergio: OK. Let’s start. You all got this. [He holds up a description of Pugh methodology]. I sent it around last Thursday. It pretty much says what we’re going to try to do, except I’m going to make a few changes. You’ll see as we go along. The basic idea today is that we want to first set up some criteria to judge. Then we compare how the fourteen go, compare them against these criteria. By the end of this morning I’d like to have narrowed things down, not to one option, but to three, say, something we can get going on. Yeah, Harold.

Harold: It says in this method that we ought to pick a baseline option to compare against. How are we going to do that? It seems to me any one of the fourteen would be as good or bad, for that matter, as any of the others.

Sergio: I thought about that, and here is what I propose. Let’s pick the option we know best, OK? Say the QWP. We know how that works, and other than that it probably won’t fit in the space we have to play with, it still can be our reference. But first we have to set up some criteria. So, let me get this chart around here.

Hans: Obviously we need a criterion, something like “Gets the job done” or “Eliminates dropout.”

Sergio: Yeah, that’s got to be one. The thing has got to work, to solve the problem. How did you state it?

Marco: What do we mean when we go and claim that, say, the QWEP eliminates the dropout? I mean, all of those up there have a chance of doing the job.

Sergio: I know. But we score, not with numbers but say three, four marks—better than the baseline, say the QWP. This is where the baseline comes in. Second would be neutral—no better, no worse than the QWP—and third would be negative; that is, we think it won’t be as good as what we know works now.

Marco: Yeah, but some of these options I think might work as good, even better on some papers but probably won’t work at all on others. How do you grade it then?

Sergio: What do you mean? Give me a more specific example.

Marco: I mean like with the air knife. It might work with Z-weight paper, but with the heavier M-weight I don’t think it will work.

Hans: Why not make that another criterion: “Works with all papers.“

Sergio: Or “Sensitivity to paper.” Sort of pull that out from under “Does the job.”

Marco: You mean that there are some options that will do the job, but some of those won’t be able to handle the heavy paper?

Sergio: Yeah, that’s one way to look at it. “Does the job” is our best guess that the thing will work, but we give paper type a separate category. We may want to say something else has to be done to handle the heavy paper; that becomes another problem.

Fritz: How do we know whether paper type is critical for the air knife? It seems to me we don’t really know what the problem is. How can we compare options when we don’t know what is causing the problem?

Marco: Fritz, that’s a good point. Do we really know enough to—

Sergio: We know we have dropout on Atlas. We know that the QWP gives good results. We have a pretty good idea of what consistency it takes to give good print—print that a trained eye can’t find a hole in. (With a magnifying glass, you still see some.)

Fritz: Yes, but we can know, and should know, a lot more before we go judging these proposals on whether or not they will solve the problem. If this place hadn’t cut back on its chemistry research, we might have a chance of knowing what the hell is going on, not just with Atlas but we had it on Mars as well.

Sergio: Look, some things are beyond our control. We have no power over the powers-that-be. We don’t have a chemistry group working on this problem to call up and say “Get over here and help us evaluate these options.” We’ve got to go with what we have. Atlas is due to go out onto the streets in seven months.

Fritz: That’s the way it always goes around her. Someone wants your solutions yesterday.

Sergio: OK. So we have “Eliminates dropout” and “Sensitivity to paper.” What are some others?

Hans: Cost.

Marco: Have you guys thought about some kind of chemical pretreatment… different papers?

Sergio: Cost. Let’s think about that. Is cost really that important? Leonard says he doesn’t see cost as really significant unless it really is some huge sum. But I don’t see how we will ever get to that point. And Atlas—

Harold: Yeah, I don’t see how unit cost can be that great. We’re not going to be able to fool around much inside Atlas at this late date.

Marco: We ought to think about what we can do without going inside.

Hans: On the other hand, if we do convince them that they have to move the paper feed, say, it is going to get costly.

Harold: In terms of engineering change but not in terms of unit costs. We still aren’t going to go in there with some exotic machinery. All those options, except maybe the E&M device, are just bending metal, cams, gears… mechanical stuff, nothing fancy.

George: We might have a problem holding tolerances. Machining can get expensive. We ask too much of my people, even with the mechanical parts.

Sergio: Maybe we make that another category, another criterion: “Engineering change,” “Extent of engineering change.”

Harold: What you really want to say is something like “Compatible with existing product.” Like the QWP we know will work fine. It does in Mars, but we know it will be extremely hard to fit in Atlas, so… Or the E&M that’s going to require a power supply, right?

Fritz: But the QWP is our reference. That’s not a good example. And that’s not a good example. And, for that matter, what good is the criterion if we know the QWP won’t fit? If that’s the case, won’t all the options be scored a plus, all the same?

Sergio: Good point, good point. But I see some that will be just as hard to retrofit—for example, the cam with a solenoid. Solenoids aren’t any miniature electronic device. They’ve got to have room, especially with the forces and reaction times we’re going to be demanding.

Hans: And the air knife requires a plenum, or the E&M—Marco, was it you who said they will need a power supply?

Sergio: Fritz, you have a good point, but let’s put it up there for now. There won’t be maybe any negatives there, but still… OK? How did you say it?

Harold: “Compatible with existing product” or maybe we ought to say “products,” with Leonard in mind.

Sergio: Yeah, got it.

Fritz: That brings up another thing. Who are we making this design for? Leonard out in Colorado and Atlas are not in sync. Atlas is well along, they’re getting into the panic mode now. But Leonard has more time, another year at least, right?

Sergio: I spoke to Leonard yesterday, and even though he has another year past Atlas, he wants to se a solution to what he thinks is his dropout problem well before that. He doesn’t want to go the panic route.

Fritz: But we still have more time with him. And shouldn’t we be thinking about the long term?

Sergio: We can’t afford to do too much of that. I’ve got the higher-ups breathing down my neck to get something going here. That makes me think fo another criterion: How well can we meet a schedule? Let’s say “Ease of schedule.”

George: How about “Pain and suffering”? [Laughter]

Sergio: No, we want to be positive about this.

Marco: Yeah, so we can mark them down. [Laughter]

Fritz: That’s why we chose the QWP as a baseline. He knows that can’t possibly fit here.

Sergio: Come on guys. That’s not true. Let’s get serious. We want to get out of here by lunchtime. Jeez, is it already 10:30?

Hans: I’ve got 10:40.

Sergio: OK. So far we’ve got—

Harold: I think we’re missing a big one. You all know how difficult it is to keep the QWP clean. Anything mechanical you add in there is going to collect sludge. Some of those, like the cam, are going to have. areal problem there with that—keeping clean.

Sergio: Good. That’s another good one. The guys in Service are not going to like it if they get called out every week.

Marco: Does that figure into the cost, the cost of servicing? Do we need a separate category?

Sergio: I think we ought to break that one out, just like we did with the paper. That’s something we are liable not to think of—what it takes to maintain the fix in the field. So let’s add—

Fritz: We don’t even know if it will work.

Sergio: We got some interesting results yesterday with a mock-up. I think it looks promising.

Fritz: But still, it’s got a long way to go. That’s what I mean. We don’t really know. if it will work, and I, at least, can’t make a good judgment even though you may be able to, because I don’t think we understand enough about the problem!

Marco: I’m with Fritz on that. I don’t think. we have enough information about these different options. I’m finding it hard to do. this method, and I think the reason is because we don’t really understand the problem.

Sergio: How much do we need to know? I admit that the E&M is a long shot, that we’ve got to get it going, that it will take a longer time to evaluate than, say, the cam concepts, and we’ve been promised a machine for next week. When we get the hardware, we can do both, evaluate the E&Ms and, in the process, get a firmer grip on what is the problem. But we don’t have all year. Jeez, it’s 11:00. We don’t have all morning either. And besides, this is just an exercise; we are not going to pick a definite option. and go with that. We only want to narrow the field some this morning. Then we give it a hard look again, after we’ve done some work on the three, come back at it and evaluate again. In fact, I can see us running pretty far with, say, two or three options in parallel, as long as they don’t interfere. Maybe that’s another thing to consider.

Hans: Seeing what time it is, maybe we better cut off our criteria here. Serge, I think we better get to ranking.

Sergio: OK, OK. So far we’ve got ‘Does the job,” “Sensitive to paper,” “Cost,” “Compatible with existing hardware,” “Ease of schedule,” “Ease of maintenance.” Anyone think of any more?

George: How about “Ease of production?”

Marco: That’s in cost. I see that as a main factor in cost.

Fritz: Look, I think we have a problem with these criteria. I’m having a hell of a time keeping them straight, trying to fix what they might mean. Are they all to be considered as having the same priority? I still think this exercise is not useful unless we know more about what we have to do, what the problem is.

Marco: I think even then these criteria would get all mixed up. When we say “Do the job” I see costs, sludge all in that, too.

Sergio: We are always going to have that problem. Where we are now, we’ve got to move. All I want is to get us narrowed down.

Fritz: But you yourself think PT’s additional option is worth keeping. I don’t think we’re ready.

Sergio: It’s getting late. We’re not going. to get there today. That’s clear. I’ll tell you what. Can we meet again? [Grunts, groans]

Sergio: No, I promise you. In the meantime, Hans and I will go back and sort out these criteria, try to explain what we see as what they are meant to measure. At least in that way we will start on the same wavelength. I will send you that before we get together. Then we will narrow.

Marco: When? I’ve got to go out to Colorado next week for two days. Can you take that into account?

Fritz: And I’m tied up in the lab the early part of the week.

George: We’ve got a production trial scheduled sometime.

Sergio: Look, I’ll have Cheryl survey, but it might have to go another week. I’ve got to get out and back to Colorado myself sometime next week. OK? Is that it? That’s enough!

(pp. 152–156)

The Pugh technique is an appealing model in principle, but we see problems crop up as Sergio tries to apply it: the engineers work to define criteria, but the categories are slippy. They have different opinions about how to cut up the space into categories, and whether they have enough information to even evaluate these criteria.

Note how well defined the problem seems to be on first glance. It’s a specific problem (dropout) on a system that otherwise has been fully designed. Not only that, but potential solutions have already been identified! Sergio’s goal is just to narrow down the solution space so that they can explore three options instead of fourteen.

Instead of a structured process, we see a much messier interaction, one that ultimately frustrates Sergio, who used the phrase “the disaster meeting” to describe what happened. What we observe, though, is a kind of progress: a group of engineers who have different understandings of the situations trying to establish common ground, building a shared understanding so that they can work together to accomplish this task. Real engineering work is messy.

Transgressing the boundaries: Rasmussen and Woods

(With apologies to Alan Sokal)

Boundary according to Rasmussen

Jens Rasmussen was a giant in the field of safety science research. You can see still his influence on the field, in the writings of safety researchers such as Sidney Dekker, Nancy Leveson, and David Woods.

One of Rasmussen’s most famous papers is Risk management in a dynamic society: a modelling problem. In that paper, Rasmussen proposed a model of system safety illustrated by the following diagram:

Reproduction of Fig. 3. The original caption reads: Under the presence of strong gradients behaviour will very likely migrate toward the boundary of acceptable performance

This model looks like it views the state of the system as a point in a state space. But, Rasmussen described it as a model of the humans working within the system. He used the term “work space” rather than “state space”. In addition, Rasmussen used the metaphor of a gas particle undergoing local random movements, a phenomenon known as Brownian motion.

Along with the random movements, Rasmussen saw envisioned different forces (he called them gradients) that influenced how the work system would move within the work space. One of these forces was pressure from management to get more work done in order to make the company more profitable. Woods refers to this phenomenon as “faster/better/cheaper pressure“. This is the arrow labeled Management Pressure toward Efficiency, which pushes away from the Boundary to Economic Failure.

One way to get more work done is to give people increasing loads of work. But people don’t like having more and more work piled on them, and so there is opposing pressure from the workforce to reduce the amount of work they have to do. This is the arrow labeled Gradient toward Least Effort which pushes away from the Boundary to Unacceptable Work Load.

The result of those two pressures is movement towards what the diagram labels “the Boundary of functionally acceptable performance”. This is the safety boundary, and we don’t know exactly where it is, which is why there’s a second boundary in the diagram labelled “Resulting perceived boundary of acceptable performance.” Accidents happen when we cross the safety boundary.

Boundary according to Woods

In David Woods’s work, he also writes about the role of boundaries in system safety, but despite this surface similarity, his model isn’t the same as Rasmussen’s.

Instead of a work space, Woods refers to an envelope. He uses terms like competence envelope or design envelope or envelope of performance. Woods has done safety research in aviation, and so I suspect he was influenced by the concept of a flight envelope in aircraft design.

Diagram captioned *Altitude envelope* from the Wikipedia flight envelope page

The flight envelope defines a region in a state space that the aircraft is designed to function properly within. You can see in the diagram above that the envelope’s boundaries are defined by the stall speed, top speed, and maximum altitude. Bad things happen if you try to operate an aircraft outside of the envelope (hence the phrase pushing the envelope).

Woods’s competence envelope is a generalization of the idea of flight envelope to other types of systems. Any system has a range of inputs that it can handle: if you go outside that range, bad things happen.

Summarizing the differences

To Rasmussen, there is only one boundary in the work space related to accidents: the safety boundary. The other boundaries in the space generally aren’t even reachable, because of the natural pressure away from them. To Woods, the competence envelope is defined by multiple boundaries, and crossing any of them can result in an accident.

Both Rasmussen and Woods identified the role of faster/better/cheaper pressure in accidents. To Rasmussen, this pressure resulted in pushing the system to the safety boundary. But to Woods, this pressure changes the behavior at the boundary. Woods sees this pressure as contributing to brittleness, to systems that don’t perform well as they get close to the boundary of the performance envelope. Woods’s current work focuses on how systems can avoid being brittle by having the ability of moving the boundary as they get closer to it: expanding the competence envelope. He calls this graceful extensibility.

Dealing with new kinds of trouble

The system is in trouble. Maybe a network link has gotten saturated, or a bad DNS configuration got pushed out. Maybe the mix of incoming requests suddenly changed and now there are a lot more heavy requests than light ones, and autoscaling isn’t helping. Perhaps a data feed got corrupted and there’s no easy way to bring the affected nodes back into a good state.

Whatever the specific details are, the system has encountered a situation that it wasn’t designed to handle. This is when the alerts go off and the human operators get involved. The operators work to reconfigure the system to get through the trouble. Perhaps they manually scale up a cluster that doesn’t scale automatically, or they recycle nodes, or make some configuration change or redirect traffic to relieve pressure from some aspect of the system.

If we think about the system in terms of the computer-y parts, the hardware and the software, then it’s clear that the system couldn’t handle this new failure mode. If it could, the humans wouldn’t have to get involve.

We can broaden our view of the system to also include the humans, sometimes known as the socio-technical system. In some cases, the socio-technical system is actually designed to handle cases that the software system alone can’t: these are the scenarios that we document in our runbooks. But, all too often, we encounter a completely novel failure mode. For the poor on-call, there’s no entry in the runbook that describes the steps to solve this problem.

In cases where the failure is completely novel, the human operators have to improvise: they have to figure out on the fly what to do, and then make the relevant operational changes to the system.

If the operators are effective, then even though the socio-technical system wasn’t designed to function properly in this face of this new kind of trouble, the people within the system make changes that result in the overall system functioning properly again.

It is this capability of a system, its ability to change itself when faced with a novel situation in order to deal effectively with that novelty, that David Woods calls graceful extensibility.

Here’s how Woods defines graceful extensibility in his paper: The Theory of Graceful Extensibility: Basic rules that govern adaptive systems:

Graceful extensibility is the opposite of brittleness, where brittleness is a sudden collapse or failure when events push the system up to and beyond its boundaries for handling changing disturbances and variations. As the opposite of brittleness, graceful extensibility is the ability of a system to extend its capacity to adapt when surprise events challenge its boundaries.

This idea is a real conceptual leap for those of us in the software world, because we’re used to thinking about the system only as the software and the hardware. The idea of a system like that adapting to a novel failure mode is alien to us, because we can’t write software that does that. If we could, we wouldn’t need to staff on-call rotations.

We humans can adapt: we can change the system, both the technical bits (e.g., changing configuration) and the human bits (e.g., changing communication patterns during an incident, either who we talk to or the communication channel involved).

However, because we don’t think of ourselves as being part of the system, when we encounter a novel failure mode, and then the human operators step in and figure out how to recover, our response is typically, “the system could not handle this failure mode (and so humans had to step in)”.

In one sense, that assessment is true: the system wasn’t designed to handle this failure mode. But in another sense, when we expand our view of the system to include the people, an alternate response is, “the system encountered a novel failure mode and we figured out how to make operational changes to make the system healthy again.”

We hit the boundary of what our system could handle, and we adapted, and we gracefully extended that boundary to include this novel situation. Our system may not be able to deal with some new kind of trouble. But, if the system has graceful extensibility, then it can change itself when the new trouble happens so it can deal with the trouble.

Objectives and constraints

Two leading thinkers of management in the twentieth century were Peter Drucker and W. Edwards Deming. Drucker developed the idea of management by objective that would eventually evolve into OKRs. In this approach, effective managers identify operational goals that can be operationalized (that’s the objective), identify metrics to measure to determine if progress is being made towards the goals (those are the key results), and then set targets for the metrics.

Deming was vehemently opposed to management by objective. Rather, he saw an organization as a system. If you wanted to improve the output of a system, you had to study it to figure out what the limiting factor was. Only once you understood the constraints that limited your system, could you address them by changing the system.

In the tech world, Drucker has clearly won out. His legacy can be seen in the adoption of OKRs by many tech companies (most famously, Intel and Google).

I’m in Deming’s camp, but I can understand why Drucker won. Drucker’s approach is much easier to put into practice than Deming’s. Specifically, Drucker gave managers an explicit process they could follow. On the other hand, Deming…, well, here’s a quote from Deming’s book Out of the Crisis:

Eliminate management by objective. Eliminate management by numbers, numerical goals. Substitute leadership.

I can see why a manager reading this might be frustrated with his exhortation to replace a specific process with “leadership”. But understanding a complex system is hard work, and there’s no process that can substitute for that. If you don’t understand the constraints that limit your system, how will you ever address them?

Why do config changes keep coming up in major incidents?

Recently, Vijay Chidambaram (a CS professor at UT Austin) asked me, “Why do so many outages involve configuration changes?”

Hypothesis: config changes are more dangerous than code changes.
— Lorin Hochstein (@norootcause) October 6, 2017

Me, a few years ago, making a similar observation

I didn’t have a good explanation for him, and I still don’t. I’m using this post as an exercise of thinking out loud about possible explanations for this phenomenon.

It’s an illusion

It might be that config changes are not somehow more dangerous, it just seems like they are. Perhaps we only notice the writeups where a config change is mentioned, but we don’t remember the writeups that don’t involve a config change. Or perhaps it’s a base rate illusion, where config changes tend to be involved in incidents more often than code changes simply because config changes are more common than code changes.

I don’t believe this hypothesis: I think the config change effect is a real one.

Config changes as second-class

In the recent Salesforce incident, the writeup noted that:

For many of Salesforce’s systems, the deployment pipelines have built-in stagger and canary requirements that are automated. For Salesforce’s DNS systems, the automation and enforcement of staggering through technology is still being built. For this configuration change and script, the stagger process was still manual.

If an organization has the ability to stage their changes across different domains, I’d wager heavily that they supported staged code deployments before they supported staged configuration change. That’s certainly true at Netflix, where Spinnaker had support for regional rollout of code changes well before it had support for regional rollout of config changes.

This one feels like a real contributor to me. I’ve found that deployment tooling tends to support code changes better than config change: there’s just more engineering effort put into making code changes safer.

Config changes are hard to stage

In the case of the Salesforce incident, the configuration change could theoretically have been staged. However, it may be that configuration changes by their nature are harder to roll out in a staged fashion. Configuration is more likely to be inherently global than code.

I’m really not sure about this one. I have no sense as to how many config changes can be staged.

Config changes are hard to test

Have you ever written a unit test for a configuration value? I haven’t. It might be that config-change related problems only manifest when deployed into a production environment, so you couldn’t catch them at a smaller scope like a unit test.

I suspect this hypothesis plays a significant role as well.

Mature systems are more config-driven

Perhaps the sort of systems that are involved in large-scale outages at big tech companies are the more mature, reliable systems. These are the types of software that have evolved over time to enable operators to control more of their behavior by specifying policy in configuration.

This means that an operator is more likely to be able to achieve a desired behavior change via config versus code. And that sounds like a good thing. We all know that hard-coding things is bad, and changing code is dangerous. In the limit, we wouldn’t have to make any code changes at all to achieve the desired system behavior.

So, perhaps the fact that config changes are more commonly implicated in large-scale outages is a sign of the maturity of the systems?

I have no idea about this one. It seems like a clever hypothesis, but perhaps it’s too clever.

Subverting the process

Recently, Salesforce released a public incident writeup for a service outage that happened in mid-May. There’s a lot of good stuff in here (DNS! A config change!), but I want to focus on one aspect of the writeup, a contributing factor described in the writeup as Subversion of the Emergency Break Fix (EBF) process.

Here are some excerpts from that section of the writeup (emphasis in the original):

An [Emergency Break Fix] is an unplanned and urgent change that is required to prevent or remediate a Severity-0, a Severity-1, or a Severity-2 incident… Non-urgent changes, i.e. those which do not require immediate attention, should not be deployed as EBFs.

…In this situation, there was no active or imminent Severity-0, Severity-1 or Severity-2 incident, so the EBF process should not have been used, and standard Salesforce stagger processes should not have been ignored.

By following an emergency process, this change avoided the extensive review scrutiny that would have occurred had it been made as a standard change under the Salesforce Change Traffic Control (CTC) process. … In this case, the engineer subverted the known policy and the appropriate disciplinary action has been taken to ensure this does not happen in the future.

“What was the engineer thinking? “ a reader wonders. I certainly did. People make decisions for reasons that make sense to them. I have no idea what the engineer’s reasoning was here, because there’s not even a hint of that reasoning alluded to here.

Is this process commonly circumvented by engineers for some reason? (i.e., was this situation actually more common than the writeup lets on?) Alternately, was the engineer facing atypical time pressure? If so, what was the nature of the time pressure?

One of the functions of public writeups is to give customers confidence in the organization’s ability to deal with future incidents. This section had the opposite effect, it filled me with dread. It communicates to me that the organization is not interested in understanding how actual work is done.

Woe be it to the next engineer caught in the double bind where there will be consequences if they don’t work quickly enough and there will be consequences if they don’t conform to a process that slows them down so much that they can’t get their work done quickly enough.

Naming names in incident writeups

In a recent Twitter thread, Alex Hidalgo from Nobl9 made the following observation about his incident reports:

This is also why I write all incident reports anonymously. “Team A Engineer” or “Team B On-call” suffices.
— Alex Hidalgo (@ahidalgosre) May 22, 2021

I take the opposite approach: I never write any of my reports anonymously. Instead, I explicitly specify the names of all of the people involved. I wanted to write a post on why I do that.

I understand the motivation for providing anonymity. We feel guilt and shame when our changes contribute to an incident. The safety literature refers to this as second victim phenomenon. We don’t write down an engineer’s name in a report because we don’t want to exacerbate the second victim effect. Also, the incident is about the system, not the particular engineer.

The reason I take the opposite approach of naming names is that I want to normalize the fact that incidents are aspects of the system, not the individuals. I feel like providing anonymity implicitly sends the signal that “the names are omitted to protect the guilty.”

My strategy in doing these writeups is to lean as heavily as I can into demonstrating to the reader that all actions taken by the engineers involved were reasonable in the moment. I want them to read the writeup and think, “This could have been me!”. I want to try to get the organization to a point where there is no shame in contributing to an incident, it’s an inevitable aspect of the work that we do.

In order to do this well, I try to write these up as much as possible from the perspective of the people involved. I find it really helps make the writeups look less judge-y (“normative”, in the jargon) by telling the story from the perspective of the individual, and calling attention to the systemic aspects.

And so, while I think Alex and I are both trying to get to the same place, we’re taking different routes.

Incident analysis as guerrilla case study research

Today I tweeted this:

Incident analysis is guerrilla case study research.
— Lorin Hochstein E_DOUBLE_BIND (@norootcause) April 11, 2021

To which Sasha Rosenbaum asked:

Wait, why?
— Sasha Rosenbaum (@DivineOps) April 11, 2021

This post is my response.

We seldom have time for introspection at work. If we’re lucky, we have the opportunity to do some kind of retrospective at the end of a project or sprint. But, generally speaking, we’re too busy working to spend time examining that work.

One exception to this is incidents: organizations are willing to spend effort on introspection after an incident happens. That’s because incidents are unsettling: people feel uneasy that the system failed in a way they didn’t expect.

And so, an organization is willing to spend precious engineering cycles in order to rid itself of the uneasy feeling that comes with a system failing unexpectedly. Let’s make sure this never happens again.

Incident analysis, in the learning from incidents in software (LFI) sense, is about using an incident as an opportunity to get a better understanding of how the overall system works. It’s a kind of case study, where the case is the incident. The incident acts as a jumping-off point for an analyst to study an aspect of the system. Just like any other case study, it involves collecting and synthesizing data from multiple sources (e.g., interviews, chat transcripts, metrics, code commits).

I call it a guerrilla case study because, from the organization’s perspective, the goal is really to get closure, to have a sense that all is right with the world. People want to get to a place where the failure mode is now well-understood and measures will be put in place to prevent it from happening again. As LFI analysts, we’re exploiting this desire for closure to justify spending time examining how work is really done inside of the system.

Ideally, organizations would recognize the value of this sort of work, and would make it explicit that the goal of incident analysis is to learn as much as possible. They’d also invest in other types of studies that look into how the overall system works. Alas, that isn’t the world we live in, so we have to sneak this sort of work in where we can.