SRE, CSE, and the safety boundary

Site reliability engineering (SRE) and cognitive systems engineering (CSE) are two fields seeking the same goal: helping to design, build, and operate complex, software-intensive systems that stay up and running. They both worry about incidents and human workload, and they both reason about systems in terms of models. But their approaches are very different, and this post is about exploring one of those differences.

Caveat: I believe that you can’t really understand a field unless you either have direct working experience, or you have observed people doing work in the field. I’m not a site reliability engineer or a cognitive systems engineer, nor have I directly observed SREs or CSEs at work. This post is an outsider’s perspective on both of these fields. But I think it holds true to the philosophies that these approaches espouse publicly. Whether it corresponds to the actual day-to-day work of SREs and CSEs, I will leave to the judgment of the folks on the ground who actually do SRE or CSE work.

A bit of background

Site reliability engineering was popularized by Google, and continues to be strongly associated with the company. Google has published three O’Reilly books, the first one in 2016. I won’t say any more about the background of SRE here, but there are many other sources (including the Google books) for those who want to know more about the background.

Cognitive systems engineering is much older, tracing its roots back to the early eighties. If SRE is, as Ben Treynor described it what happens when you ask a software engineer to design an operations function, then CSE is what happens when you ask a psychologist how to prevent nuclear meltdowns.

CSE emerged in the wake of the Three Mile Island accident of 1979, where researchers were trying to make sense of how the accident happened. Before Three Mile Island, research on "human factors" aspects of work had focused on human physiology (for example, designing airplane cockpits), but after TMI the focused expanded to include cognitive aspects of work. The two researchers most closely associated with CSE, Erik Hollnagel and David Woods, were both trained as psychology researchers: their paper Cognitive Systems Engineering: New wine in new bottles marks the birth of the field (Thai Wood covered this paper in his excellent Resilience Roundup newsletter).

CSE has been applied in many different domains, but I think it would be unknown in the "tech" community were it not for the tireless efforts of John Allspaw to popularize the results of CSE research that has been done in the past four decades.

A useful metaphor: Rasmussen’s dynamic safety model

Jens Rasmussen was a Danish safety researcher whose work remains deeply influential in CSE. In 1997 he published a paper titled Risk management in a dynamic society: a modelling problem. This paper introduced the metaphor of the safety boundary, as illustrated in the following visual model, which I’ve reproduced from this paper:

Rasmussen viewed a safety-critical system as a point that moves inside of a space enclosed by three boundaries.

At the top right is what Rasmussen called the "boundary to economic failure". If the system crosses this boundary, then the system will fail due to poor economic performance. We know that if we try to work too quickly, we sacrifice safety. But we can’t work arbitrarily slowly to increase safety, because then we won’t get anything done. Management naturally puts pressure on the system to move away from this boundary.

At the bottom right is what Rasmussen called the "boundary of unacceptable work load". Management can apply pressure on the workforce to work both safely and quickly, but increasing safety and increasing productivity both require effort on behalf of practitioners, and there are limits to the amount of work that people can do. Practitioners naturally put pressure on the system to move away from this boundary.

At the left, the diagram has two boundaries. The outer boundary is what Rasmussen called the "boundary of functionally acceptable performance", what I’ll call the safety boundary. If the system crosses this boundary, an incident happens. We can never know exactly where this boundary is. The inner boundary is labelled "resulting perceived boundary of acceptable performance". That’s where we think the boundary is, and where we try to stay away from.

SRE vs CSE in context of the dynamic safety model

I find the dynamic safety model useful because I think it illustrates the difference in focus between SRE and CSE.

SRE focuses on two questions:

  1. How do we keep the system away from the safety boundary?
  2. What do we do once we’ve crossed the boundary?

To deal with the first question, SRE thinks about issues such as how to design systems and how to introduce changes safely. The second question is the realm of incident response.

CSE, on the other hand, focuses on the following questions:

  1. How will the system behave near the system boundary?
  2. How should we take this boundary behavior into account in our design?

CSE focuses on the space near the boundary, both to learn how work is actually done, and to inform how we should design tools to better support this work. In the words of Woods and Hollnagel:

> Discovery is aided by looking at situations that are near the margins of practice and when resource saturation is threatened (attention, workload, etc.). These are the circumstances when one can see how the system stretches to accommodate new demands, and the sources of resilience that usually bridge gaps. – Joint Cognitive Systems: Patterns in Cognitive Systems Engineering, p37

Fascinatingly, CSE has also identified common patterns of system behavior at the boundary that holds across multiple domains. But that will have to wait for a different post.

Reading more about CSE

I’m still a novice in the field of cognitive systems engineering. I’m actually using these posts to help learn through explaining the concepts to others.

The source I’ve found most useful so far is the book Joint Cognitive Systems: Patterns in Cognitive Systems Engineering , which is referenced in this post. If you prefer videos, Cook’s Lectures on the study of cognitive work is excellent.

I’ve also started a CSE reading list.

2 thoughts on “SRE, CSE, and the safety boundary

Leave a comment