Transgressing the boundaries: Rasmussen and Woods

(With apologies to Alan Sokal)

Boundary according to Rasmussen

Jens Rasmussen was a giant in the field of safety science research. You can see still his influence on the field, in the writings of safety researchers such as Sidney Dekker, Nancy Leveson, and David Woods.

One of Rasmussen’s most famous papers is Risk management in a dynamic society: a modelling problem. In that paper, Rasmussen proposed a model of system safety illustrated by the following diagram:

Reproduction of Fig. 3. The original caption reads: Under the presence of strong gradients behaviour will very likely migrate toward the boundary of acceptable performance

This model looks like it views the state of the system as a point in a state space. But, Rasmussen described it as a model of the humans working within the system. He used the term “work space” rather than “state space”. In addition, Rasmussen used the metaphor of a gas particle undergoing local random movements, a phenomenon known as Brownian also.

Along with the random movements, Rasmussen saw envisioned different forces (he called them gradients) that influenced how the work system would move within the work space. One of these forces was pressure from management to get more work done in order to make the company more profitable. Woods refers to this phenomenon as “faster/better/cheaper pressure“. This is the arrow labeled Management Pressure toward Efficiency, which pushes away from the Boundary to Economic Failure.

One way to get more work done is to give people increasing loads of work. But people don’t like having more and more work piled on them, and so there is opposing pressure from the workforce to reduce the amount of work they have to do. This is the arrow labeled Gradient toward Least Effort which pushes away from the Boundary to Unacceptable Work Load.

The result of those two pressures is movement towards what the diagram labels “the Boundary of functionally acceptable performance”. This is the safety boundary, and we don’t know exactly where it is, which is why there’s a second boundary in the diagram labelled “Resulting perceived boundary of acceptable performance.” Accidents happen when we cross the safety boundary.

Boundary according to Woods

In David Woods’s work, he also writes about the role of boundaries in system safety, but despite this surface similarity, his model isn’t the same as Rasmussen’s.

Instead of a work space, Woods refers to an envelope. He uses terms like competence envelope or design envelope or envelope of performance. Woods has done safety research in aviation, and so I suspect he was influenced by the concept of a flight envelope in aircraft design.

Diagram captioned Altitude envelope from the Wikipedia flight envelope page

The flight envelope defines a region in a state space that the aircraft is designed to function properly within. You can see in the diagram above that the envelope’s boundaries are defined by the stall speed, top speed, and maximum altitude. Bad things happen if you try to operate an aircraft outside of the envelope (hence the phrase pushing the envelope).

Woods’s competence envelope is a generalization of the idea of flight envelope to other types of systems. Any system has a range of inputs that it can handle: if you go outside that range, bad things happen.

Summarizing the differences

To Rasmussen, there is only one boundary in the work space related to accidents: the safety boundary. The other boundaries in the space generally aren’t even reachable, because of the natural pressure away from them. To Woods, the competence envelope is defined by multiple boundaries, and crossing any of them can result in an accident.

Both Rasmussen and Woods identified the role of faster/better/cheaper pressure in accidents. To Rasmussen, this pressure resulted in pushing the system to the safety boundary. But to Woods, this pressure changes the behavior at the boundary. Woods sees this pressure as contributing to brittleness, to systems that don’t perform well as they get close to the boundary of the performance envelope. Woods’s current work focuses on how systems can avoid being brittle by having the ability of moving the boundary as they get closer to it: expanding the competence envelope. He calls this graceful extensibility.

Dealing with new kinds of trouble

The system is in trouble. Maybe a network link has gotten saturated, or a bad DNS configuration got pushed out. Maybe the mix of incoming requests suddenly changed and now there are a lot more heavy requests than light ones, and autoscaling isn’t helping. Perhaps a data feed got corrupted and there’s no easy way to bring the affected nodes back into a good state.

Whatever the specific details are, the system has encountered a situation that it wasn’t designed to handle. This is when the alerts go off and the human operators get involved. The operators work to reconfigure the system to get through the trouble. Perhaps they manually scale up a cluster that doesn’t scale automatically, or they recycle nodes, or make some configuration change or redirect traffic to relieve pressure from some aspect of the system.

If we think about the system in terms of the computer-y parts, the hardware and the software, then it’s clear that the system couldn’t handle this new failure mode. If it could, the humans wouldn’t have to get involve.

We can broaden our view of the system to also include the humans, sometimes known as the socio-technical system. In some cases, the socio-technical system is actually designed to handle cases that the software system alone can’t: these are the scenarios that we document in our runbooks. But, all too often, we encounter a completely novel failure mode. For the poor on-call, there’s no entry in the runbook that describes the steps to solve this problem.

In cases where the failure is completely novel, the human operators have to improvise: they have to figure out on the fly what to do, and then make the relevant operational changes to the system.

If the operators are effective, then even though the socio-technical system wasn’t designed to function properly in this face of this new kind of trouble, the people within the system make changes that result in the overall system functioning properly again.

It is this capability of a system, its ability to change itself when faced with a novel situation in order to deal effectively with that novelty, that David Woods calls graceful extensibility.

Here’s how Woods defines graceful extensibility in his paper: The Theory of Graceful Extensibility: Basic rules that govern adaptive systems:

Graceful extensibility is the opposite of brittleness, where brittleness is a sudden collapse or failure when events push the system up to and beyond its boundaries for handling changing disturbances and variations. As the opposite of brittleness, graceful extensibility is the ability of a system to extend its capacity to adapt when surprise events challenge its boundaries.

This idea is a real conceptual leap for those of us in the software world, because we’re used to thinking about the system only as the software and the hardware. The idea of a system like that adapting to a novel failure mode is alien to us, because we can’t write software that does that. If we could, we wouldn’t need to staff on-call rotations.

We humans can adapt: we can change the system, both the technical bits (e.g., changing configuration) and the human bits (e.g., changing communication patterns during an incident, either who we talk to or the communication channel involved).

However, because we don’t think of ourselves as being part of the system, when we encounter a novel failure mode, and then the human operators step in and figure out how to recover, our response is typically, “the system could not handle this failure mode (and so humans had to step in)”.

In one sense, that assessment is true: the system wasn’t designed to handle this failure mode. But in another sense, when we expand our view of the system to include the people, an alternate response is, “the system encountered a novel failure mode and we figured out how to make operational changes to make the system healthy again.

We hit the boundary of what our system could handle, and we adapted, and we gracefully extended that boundary to include this novel situation. Our system may not be able to deal with some new kind of trouble. But, if the system has graceful extensibility, then it can change itself when the new trouble happens so it can deal with the trouble.

Objectives and constraints

Two leading thinkers of management in the twentieth century were Peter Drucker and W. Edwards Deming. Drucker developed the idea of management by objective that would eventually evolve into OKRs. In this approach, effective managers identify operational goals that can be operationalized (that’s the objective), identify metrics to measure to determine if progress is being made towards the goals (those are the key results), and then set targets for the metrics.

Deming was vehemently opposed to management by objective. Rather, he saw an organization as a system. If you wanted to improve the output of a system, you had to study it to figure out what the limiting factor was. Only once you understood the constraints that limited your system, could you address them by changing the system.

In the tech world, Drucker has clearly won out. His legacy can be seen in the adoption of OKRs by many tech companies (most famously, Intel and Google).

I’m in Deming’s camp, but I can understand why Drucker won. Drucker’s approach is much easier to put into practice than Deming’s. Specifically, Drucker gave managers an explicit process they could follow. On the other hand, Deming…, well, here’s a quote from Deming’s book Out of the Crisis:

Eliminate management by objective. Eliminate management by numbers, numerical goals. Substitute leadership.

I can see why a manager reading this might be frustrated with his exhortation to replace a specific process with “leadership”. But understanding a complex system is hard work, and there’s no process that can substitute for that. If you don’t understand the constraints that limit your system, how will you ever address them?

Why do config changes keep coming up in major incidents?

Recently, Vijay Chidambaram (a CS professor at UT Austin) asked me, “Why do so many outages involve configuration changes?”

Me, a few years ago, making a similar observation

I didn’t have a good explanation for him, and I still don’t. I’m using this post as an exercise of thinking out loud about possible explanations for this phenomenon.

It’s an illusion

It might be that config changes are not somehow more dangerous, it just seems like they are. Perhaps we only notice the writeups where a config change is mentioned, but we don’t remember the writeups that don’t involve a config change. Or perhaps it’s a base rate illusion, where config changes tend to be involved in incidents more often than code changes simply because config changes are more common than code changes.

I don’t believe this hypothesis: I think the config change effect is a real one.

Config changes as second-class

In the recent Salesforce incident, the writeup noted that:

For many of Salesforce’s systems, the deployment pipelines have built-in stagger and canary requirements that are automated. For Salesforce’s DNS systems, the automation and enforcement of staggering through technology is still being built. For this configuration change and script, the stagger process was still manual.  

If an organization has the ability to stage their changes across different domains, I’d wager heavily that they supported staged code deployments before they supported staged configuration change. That’s certainly true at Netflix, where Spinnaker had support for regional rollout of code changes well before it had support for regional rollout of config changes.

This one feels like a real contributor to me. I’ve found that deployment tooling tends to support code changes better than config change: there’s just more engineering effort put into making code changes safer.

Config changes are hard to stage

In the case of the Salesforce incident, the configuration change could theoretically have been staged. However, it may be that configuration changes by their nature are harder to roll out in a staged fashion. Configuration is more likely to be inherently global than code.

I’m really not sure about this one. I have no sense as to how many config changes can be staged.

Config changes are hard to test

Have you ever written a unit test for a configuration value? I haven’t. It might be that config-change related problems only manifest when deployed into a production environment, so you couldn’t catch them at a smaller scope like a unit test.

I suspect this hypothesis plays a significant role as well.

Mature systems are more config-driven

Perhaps the sort of systems that are involved in large-scale outages at big tech companies are the more mature, reliable systems. These are the types of software that have evolved over time to enable operators to control more of their behavior by specifying policy in configuration.

This means that an operator is more likely to be able to achieve a desired behavior change via config versus code. And that sounds like a good thing. We all know that hard-coding things is bad, and changing code is dangerous. In the limit, we wouldn’t have to make any code changes at all to achieve the desired system behavior.

So, perhaps the fact that config changes are more commonly implicated in large-scale outages is a sign of the maturity of the systems?

I have no idea about this one. It seems like a clever hypothesis, but perhaps it’s too clever.

Subverting the process

Recently, Salesforce released a public incident writeup for a service outage that happened in mid-May. There’s a lot of good stuff in here (DNS! A config change!), but I want to focus on one aspect of the writeup, a contributing factor described in the writeup as Subversion of the Emergency Break Fix (EBF) process.

Here are some excerpts from that section of the writeup (emphasis in the original):

An [Emergency Break Fix] is an unplanned and urgent change that is required to prevent or remediate a Severity-0, a Severity-1, or a Severity-2 incident… Non-urgent changes, i.e. those which do not require immediate attention, should not be deployed as EBFs.

In this situation, there was no active or imminent Severity-0, Severity-1 or Severity-2 incident, so the EBF process should not have been used, and standard Salesforce stagger processes should not have been ignored. 

By following an emergency process, this change avoided the extensive review scrutiny that would have occurred had it been made as a standard change under the Salesforce Change Traffic Control (CTC) process. … In this case, the engineer subverted the known policy and the appropriate disciplinary action has been taken to ensure this does not happen in the future.

What was the engineer thinking? “ a reader wonders. I certainly did. People make decisions for reasons that make sense to them. I have no idea what the engineer’s reasoning was here, because there’s not even a hint of that reasoning alluded to here.

Is this process commonly circumvented by engineers for some reason? (i.e., was this situation actually more common than the writeup lets on?) Alternately, was the engineer facing atypical time pressure? If so, what was the nature of the time pressure?

One of the functions of public writeups is to give customers confidence in the organization’s ability to deal with future incidents. This section had the opposite effect, it filled me with dread. It communicates to me that the organization is not interested in understanding how actual work is done.

Woe be it to the next engineer caught in the double bind where there will be consequences if they don’t work quickly enough and there will be consequences if they don’t conform to a process that slows them down so much that they can’t get their work done quickly enough.

Naming names in incident writeups

In a recent Twitter thread, Alex Hidalgo from Nobl9 made the following observation about his incident reports:

I take the opposite approach: I never write any of my reports anonymously. Instead, I explicitly specify the names of all of the people involved. I wanted to write a post on why I do that.

I understand the motivation for providing anonymity. We feel guilt and shame when our changes contribute to an incident. The safety literature refers to this as second victim phenomenon. We don’t write down an engineer’s name in a report because we don’t want to exacerbate the second victim effect. Also, the incident is about the system, not the particular engineer.

The reason I take the opposite approach of naming names is that I want to normalize the fact that incidents are aspects of the system, not the individuals. I feel like providing anonymity implicitly sends the signal that “the names are omitted to protect the guilty.”

My strategy in doing these writeups is to lean as heavily as I can into demonstrating to the reader that all actions taken by the engineers involved were reasonable in the moment. I want them to read the writeup and think, “This could have been me!”. I want to try to get the organization to a point where there is no shame in contributing to an incident, it’s an inevitable aspect of the work that we do.

In order to do this well, I try to write these up as much as possible from the perspective of the people involved. I find it really helps make the writeups look less judge-y (“normative”, in the jargon) by telling the story from the perspective of the individual, and calling attention to the systemic aspects.

And so, while I think Alex and I are both trying to get to the same place, we’re taking different routes.