SRE, CSE, and the safety boundary

Site reliability engineering (SRE) and cognitive systems engineering (CSE) are two fields seeking the same goal: helping to design, build, and operate complex, software-intensive systems that stay up and running. They both worry about incidents and human workload, and they both reason about systems in terms of models. But their approaches are very different, and this post is about exploring one of those differences.

Caveat: I believe that you can’t really understand a field unless you either have direct working experience, or you have observed people doing work in the field. I’m not a site reliability engineer or a cognitive systems engineer, nor have I directly observed SREs or CSEs at work. This post is an outsider’s perspective on both of these fields. But I think it holds true to the philosophies that these approaches espouse publicly. Whether it corresponds to the actual day-to-day work of SREs and CSEs, I will leave to the judgment of the folks on the ground who actually do SRE or CSE work.

A bit of background

Site reliability engineering was popularized by Google, and continues to be strongly associated with the company. Google has published three O’Reilly books, the first one in 2016. I won’t say any more about the background of SRE here, but there are many other sources (including the Google books) for those who want to know more about the background.

Cognitive systems engineering is much older, tracing its roots back to the early eighties. If SRE is, as Ben Treynor described it what happens when you ask a software engineer to design an operations function, then CSE is what happens when you ask a psychologist how to prevent nuclear meltdowns.

CSE emerged in the wake of the Three Mile Island accident of 1979, where researchers were trying to make sense of how the accident happened. Before Three Mile Island, research on "human factors" aspects of work had focused on human physiology (for example, designing airplane cockpits), but after TMI the focused expanded to include cognitive aspects of work. The two researchers most closely associated with CSE, Erik Hollnagel and David Woods, were both trained as psychology researchers: their paper Cognitive Systems Engineering: New wine in new bottles marks the birth of the field (Thai Wood covered this paper in his excellent Resilience Roundup newsletter).

CSE has been applied in many different domains, but I think it would be unknown in the "tech" community were it not for the tireless efforts of John Allspaw to popularize the results of CSE research that has been done in the past four decades.

A useful metaphor: Rasmussen’s dynamic safety model

Jens Rasmussen was a Danish safety researcher whose work remains deeply influential in CSE. In 1997 he published a paper titled Risk management in a dynamic society: a modelling problem. This paper introduced the metaphor of the safety boundary, as illustrated in the following visual model, which I’ve reproduced from this paper:

Rasmussen viewed a safety-critical system as a point that moves inside of a space enclosed by three boundaries.

At the top right is what Rasmussen called the "boundary to economic failure". If the system crosses this boundary, then the system will fail due to poor economic performance. We know that if we try to work too quickly, we sacrifice safety. But we can’t work arbitrarily slowly to increase safety, because then we won’t get anything done. Management naturally puts pressure on the system to move away from this boundary.

At the bottom right is what Rasmussen called the "boundary of unacceptable work load". Management can apply pressure on the workforce to work both safely and quickly, but increasing safety and increasing productivity both require effort on behalf of practitioners, and there are limits to the amount of work that people can do. Practitioners naturally put pressure on the system to move away from this boundary.

At the left, the diagram has two boundaries. The outer boundary is what Rasmussen called the "boundary of functionally acceptable performance", what I’ll call the safety boundary. If the system crosses this boundary, an incident happens. We can never know exactly where this boundary is. The inner boundary is labelled "resulting perceived boundary of acceptable performance". That’s where we think the boundary is, and where we try to stay away from.

SRE vs CSE in context of the dynamic safety model

I find the dynamic safety model useful because I think it illustrates the difference in focus between SRE and CSE.

SRE focuses on two questions:

  1. How do we keep the system away from the safety boundary?
  2. What do we do once we’ve crossed the boundary?

To deal with the first question, SRE thinks about issues such as how to design systems and how to introduce changes safely. The second question is the realm of incident response.

CSE, on the other hand, focuses on the following questions:

  1. How will the system behave near the system boundary?
  2. How should we take this boundary behavior into account in our design?

CSE focuses on the space near the boundary, both to learn how work is actually done, and to inform how we should design tools to better support this work. In the words of Woods and Hollnagel:

> Discovery is aided by looking at situations that are near the margins of practice and when resource saturation is threatened (attention, workload, etc.). These are the circumstances when one can see how the system stretches to accommodate new demands, and the sources of resilience that usually bridge gaps. – Joint Cognitive Systems: Patterns in Cognitive Systems Engineering, p37

Fascinatingly, CSE has also identified common patterns of system behavior at the boundary that holds across multiple domains. But that will have to wait for a different post.

Reading more about CSE

I’m still a novice in the field of cognitive systems engineering. I’m actually using these posts to help learn through explaining the concepts to others.

The source I’ve found most useful so far is the book Joint Cognitive Systems: Patterns in Cognitive Systems Engineering , which is referenced in this post. If you prefer videos, Cook’s Lectures on the study of cognitive work is excellent.

I’ve also started a CSE reading list.

This mess we’re in

Most real software systems feel “messy” to the engineers who work on them. I’ve found that software engineers tend to fall into one of two camps on the nature of this messiness.

Camp 1: Problems with the design

One camp believes that the messiness is primarily related to sub-optimal design decisions. These design decisions might simply be poor decisions, or they might be because we aren’t able to spend enough time getting the design right.

My favorite example of this school of thought can be found in the text of Leslie Lamport’s talk entitled The Future of Computing: Logic or Biology:

The best way to cope with complexity is to avoid it. Programs that do a lot for us are inherently complex. But instead of surrendering to complexity, we must control it. We must keep our systems as simple as possible. Having two different kinds of drag-and-drop behavior is more complicated than having just one. And that is needless complexity.

We must keep our systems simple enough so we can understand them.
And the fundamental tool for doing this is the logic of mathematics.

Leslie Lamport, The Future of Computing: Logic or Biology

Camp 2: The world is messy

Engineers in the second camp believe that reality is just inherently messy, and that mess ends up being reflected in software systems that have to model the world. Rich Hickey describes this in what he calls “situated programs” (emphasis mine)

And they [situated programs] deal with real-world irregularity. This is the other thing I think that’s super-critical, you know, in this situated programming world. It’s never as elegant as you think, the real-world.

And I talked about that scheduling problem of, you know, those linear times, somebody who listens all day, and the somebody who just listens while they’re driving in the morning and the afternoon. Eight hours apart there’s one set of people and, then an hour later there’s another set of people, another set. You know, you have to think about all that time. You come up with this elegant notion of multi-dimensional time and you’d be like, “oh, I’m totally good…except on Tuesday”. Why? Well, in the U.S. on certain kinds of genres of radio, there’s a thing called “two for Tuesday”. Right? So you built this scheduling system, and the main purpose of the system is to never play the same song twice in a row, or even pretty near when you played it last. And not even play the same artist near when you played the artist, or else somebody’s going to say, “all you do is play Elton John, I hate this station”.

But on Tuesday, it’s a gimmick. “Two for Tuesday” means, every spot where we play a song, we’re going to play two songs by that artist. Violating every precious elegant rule you put in the system. And I’ve never had a real-world system that didn’t have these kinds of irregularities. And where they weren’t important.

Rich Hickey, Effective Programs: 10 Years of Clojure

It will come as no surprise to readers of this blog that I fall into the second camp. I do think that sub-optimal design decisions also contribute to messiness in software systems, but I think those are inevitable because unexpected changes and time pressures are inescapable. This is the mess we’re in.

Why you can’t just ask “why”

Today, most AI work is based on neural networks, but back in the 1980s, AI researchers were using a different approach: they built rule-based systems using mathematical logic. This was the heyday of Lisp and Prolog, which were well-suited towards implementing these systems.

One approach AI researchers used was to sit down with an expert and elicit the rules they used to perform a task. For example, an AI researcher might conduct a series of interviews with a doctor in order to determine how the doctor diagnosed illnesses based on symptoms. The researcher would then encode those rules to build an expert system: a software package that would, ideally, perform tasks as well as an expert.

Alas, the results were disappointing: these expert systems never measured up to the performance of those human experts. Two brothers: Stuart Dreyfus (a professor of industrial engineering and operations research) and Hubert Dreyfus (a professor of philosophy) published a book in 1998 titled Mind Over Machine that described why this approach to building expert systems by eliciting and encoding rules from experts could never really work. It turns out that experts don’t actually solve problems by following a set of rules. Instead, they rely more on intuition and pattern-matching based on a repertoire of cases they’ve built up from their experience1.

Yet, even though those experts didn’t solve problems by following rules, they were still able to articulate a set of rules that they claim to follow when asked. And they weren’t trying to deceive the AI researchers. Instead, something else was going on. The experts were inventing explanations without even being aware that they were doing so. Philosophers of mind use the term confabulation (technically broad confabulation) to refer to this phenomenon: how people will unknowingly fabricate explanations for their actions.

And therein lies the problem of asking “why”.

In the wake of an incident, we often want to understand why it is people did certain things: both for the people whose actions contributed to the incident (why did they make a global configuration change?) and for people whose actions mitigated the incident (why did they suspect a retry storm rather than a DDOS attack?)

The problem is, you can’t just ask people why, because people confabulate. You can, of course, simply ask people why they took the actions they did. Heck, you might even get a confident, articulated explanation. But you shouldn’t believe that the explanation they give corresponds to reality.

Yet, getting at the why is important. This is not a case of “‘Why?’ is the wrong question the way that Five Whys style questions are. There is real value in understanding how people came to the decisions they did, by learning about the signals they received at the time, and how their previous experiences shaped their perspectives. That’s where having a skilled interviewer comes in.

A skilled interviewer will increase the chances of getting an accurate response by asking questions to bring the interviewee back into the frame of mind that they were in during the incident. Instead of asking for an engineer to explain their actions (Why did you do X?), they’ll ask questions to try to jog their memory of what they were experiencing during the incident: What were you doing when the page went off? Where did you look first? What did you see? And then what did you do? Because we know that experts do pattern-matching, they’ll also ask questions like, have you ever seen this symptom before? These questions can elicit responses about previous experiences they’ve had in similar situations, which can provide context on how they made their decisions in this case.

Eliciting this sort of information from an interview is hard, and it takes real skill. We should take this sort of work seriously.

1The field of research known as naturalistic decision making studies how experts make decisions.

Asking the right “why” questions

In the [Cognitive Systems Engineering] terminology, it is more important to understand what a joint cognitive system (JCS) does and why it does it, than to explain how it does it. [emphasis in the original]

Erik Hollnagel & David D. Woods, Joint Cognitive Systems: Foundations of Cognitive Systems Engineering, p22

In my previous post, I linked to a famous essay by John Allspaw: The Infinite Hows (or, the Dangers Of The Five Whys). The main thrust of Allspaw’s essay can be summed up in this five word excerpt:

“Why?” is the wrong question.

As illustrated by the quote from Hollnagel & Woods at the top of this post, it turns out that cognitive systems engineering (CSE) is very big on answering “why” questions. Allspaw’s perspective on incident analysis is deeply influenced by research from cognitive systems engineering. So what’s going on here?

It turns out that the CSE folks are asking different kinds of “why” questions than the root cause analysis (RCA) folks. The RCA folks ask why did this incident happen? The CSE folks ask why did the system adapt the sorts of behaviors that contributed to the incident?

Those questions may sound similar, but they start from opposite assumptions. The RCA folks start with the assumption that there’s some sort of flaw in the system, a vulnerability that was previously unknown, and then base their analysis on identifying what that vulnerability was.

The CSE folks, on the other hand, start with the assumption that behaviors exhibited by the system developed through adaptation to existing constraints. The “why” question here is “why is this behavior adaptive? What purpose does it serve in the system?” Then they base the analysis on identifying attributes of the system such as constraints and goal conflicts that would explain why this behavior is adaptive.

This is one of the reasons why the CSE folks are so interested in incidents to begin with: because it can expose these kinds of constraints and conflicts that are part of the context of a system. It’s similar to how psychologists use optical illusions to study the heuristics that the human visual system employs: you look at the circumstances under which a system fails to get some insight into how it normally functions as well as it does.

“Why” questions can be useful! But you’ve got to ask the right ones.

Making peace with “root cause” during anomaly response

We haven’t figured out the root cause yet.

Uttered by many an engineer while responding to an anomaly

One of the contributions of cognitive systems engineering is treating anomaly response as something worthy of study. Here’s how Woods and Hollnagel describe it:

In anomaly response, there is some underlying process, an engineered or physiological process which will be referred to as the monitored process, whose state changes over time. Faults disturb the functions that go on in the monitored process and generate the demand for practitioners to act to compensate for these disturbances in order to maintain process integrity―what is sometimes referred to as “safing” activities. In parallel, practitioners carry out diagnostic activities to determine the source of the disturbances in order to correct the underlying problem..

David D. Woods, Erik Hollnagel, Joint Cognitive Systems: Patterns in Cognitive Systems Engineering, Chapter 8, p71

This type of work will be instantly recognizable to anyone who has been involved in software operations work, even though the domains that cognitive systems engineering researchers initially focused on are completely different (e.g., nuclear power plants, anesthesiology, commercial aviation, space flight).

Anomaly response involves multiple people working together, coordinating on resolving a common problem. Here’s Woods and Hollnagel again, discussing an exchange between two anesthesiologists during a neurosurgery case:

The situation calls for an update to the shared model of the case and its likely trajectory in the future … The exchange is very compact and highly coded, yet it serves to update the common ground previously established at the start of the case… Interestingly, the resident and attending after the update appear to be without candidate explanations as several possibilities have been dismissed given other findings (the resident is quite explicit in this case. After describing the unexpected event, he also adds―”but no explanation”.

p93, ibid

Just like those anesthesiologists, practitioners in all domains often communicate using “compact and highly coded” jargon. I’ve seen claims that jargon is intended to obfuscate, but it’s just the opposite during anomaly response: a team that shares jargon can communicate more efficiently, because of the pre-existing shared context about the precise meaning of those terms (assuming, of course, the team members understand those terms the same way).

That brings us to the “root cause”. Let’s start with a few words about root cause analysis.

Root cause analysis (RCA) is an approach for identifying why an incident happened. It’s often associated with the Five Whys approach, associated with Toyota. Members of the resilience engineering community have been very critical of RCA. For one critical take, check out John Allspaw’s piece The Infinite Hows (or, the Dangers of The Five Whys). Allspaw makes a compelling case, and I agree with him.

What I’m arguing in this post is that the term “root cause” has a completely different connotation when used during anomaly response than when used during post-incident analysis. When, during anomaly response, an engineer says “I haven’t found the root cause”, they do not mean, “I have not yet performed a Five-Whys root cause analysis”. Instead, they mean “the signals I am observing are inconsistent with my mental model of how the system behaves”. Or, less formally, “I know something’s wrong here, but I don’t know what it is!”

When an engineer says “we don’t know the root cause yet” during anomaly response, everybody involved understands what they mean. If you were to reply “actually, there is no such thing as root cause”, the best response you could hope for is a blank stare. The engineers aren’t talking about Five-Whys in this context. Instead, they’re doing dynamic fault management. They’re trying to make sense of what’s currently happening.

Because I’m one of those folks who is critical of RCA, I used to try to encourage people to say, “we don’t understand the failure mode yet” instead of “we don’t know the root cause yet”. But I’m going to stop encouraging them, because I have come around to believing that “we don’t know the root cause” is effective jargon when coordinating during anomaly response.

The post-incident investigation context is another story entirely, and I’m still going to fight the battles against root cause during the post-incident work. But, just as I wouldn’t try to do a one-on-one interview with an engineer while they were engaged in an incident, I’m no longer going to try do get engineers to stop saying root cause while they are engaged in an incident. If the experts at anomaly response find it a useful phrase while they are doing their work, we should recognize this as a part of their expertise.

Yes, it will probably make it a little harder to get an organization to shake off the notion of “root cause” if people still freely use the term during anomaly response. And I won’t use the term myself. But it’s a battle I no longer think is worth fighting.

The hard parts about making it look easy

Bill Clinton was known for projecting warmth in a way that Hillary Clinton didn’t. Yet, when journalists would speak to people who knew them both personally, the story they’d get back was the opposite: one-on-one, it was Hillary Clinton who was the warm one.

We use terms like warmth and authenticity as if they were character attributes of people. But imagine if you were asked to give a speech in front of a large audience. Do you think that if you came off as wooden or stilted, that would be an indicator of how authentic you are as a person?

The ability to project authenticity or warmth is a skill. Experts exhibiting skilled behavior often appear to do it effortlessly. When we watch a virtuoso perform in a domain we know something about, we exclaim “they make it look so easy!“, because we know how much harder it is than it looks.

The resilience engineering researcher David Woods calls this phenomenon the law of fluency, which he define as:

“Well”-adapted work occurs with a facility that belies the difficulty of the demands resolved and the dilemmas balanced.

Joint Cognitive Systems: Patterns in Cognitive Systems Engineering (p20)

This law is the source of two problems.

First of all, novices tend to mistake skilled performance that seems effortless as innate, rather than a skill that was developed with practice. They don’t see the work, so they don’t know how to get there.

Second of all, skilled practitioners are at increased risk of undetected burnout because they make it look easy even when they are working too hard. This is something that’s easy to miss unless we actively probe for it.