Safe by design?

I’ve been enjoying the ongoing MIT STAMP workshop. In particular, I’ve been enjoying listening to Nancy Leveson talk about system safety. Leveson is a giant in the safety research community (and, incidentally, an author of my favorite software engineering study). She’s also a critic of root cause and human error as explanations for accidents. Despite this, she has a different perspective on safety than many in the resilience engineering community. To sharpen my thinking, I’m going to capture my understanding of this difference in this post below.

From Leveson’s perspective, the engineering design should ensure that the system is safe. More specifically, the design should contain controls that eliminate or mitigate hazards. In this view, accidents are invariably attributable to design errors: a hazard in the system was not effectively controlled in the design.

By contrast, many in the resilience engineering community claim that design alone cannot ensure that the system is safe. The idea here is that the system design will always be incomplete, and the human operators must adapt their local work to make up for the gaps in the designed system. These adaptations usually contribute to safety, and sometimes contribute to incidents, and in post-incident investigations we often only notice the latter case.

These perspectives are quite different. Leveson believes that depending on human adaptation in the system is itself dangerous. If we’re depending on human adaptation to achieve system safety, then the design engineers have not done their jobs properly in controlling hazards. The resilience engineering folks believe that depending on human adaptation is inevitable, because of the messy nature of complex systems.

All we can do is find problems

I’m in the second week of the three week virtual MIT STAMP workshop. Today, Prof. Nancy Leveson gave a talk titled Safety Assurance (Safety Case): Is it Possible? Feasible? Safety assurance refers to the act of assuring that a system is safe, after the design has been completed.

Leveson is a skeptic of evaluating the safety of a system. Instead, she argues for focusing on generating safety requirements at the design stage so that safety can be designed in, rather than doing an evaluation post-design. (You can read her white paper for more details on her perspective). Here are the last three bullets from her final slide:

  • If you are using hazard analysis to prove your system is safe, then you are using it wrong and your goal is futile
  • Hazard analysis (using any method) can only help you find problems, it cannot prove that no problems exist
  • The general problem is in setting the right psychological goal. It should not be “confirmation,” but exploration

This perspective resonated with me, because it matches how I think about availability metrics. You can’t use availability metrics to inform you about whether your system is reliable enough, because they can only tell you if you have a problem. If your availability metrics look good, that doesn’t tell you anything about how to spend your engineering resources on reliability.

As Leveson remarked about safety, I think the best we can do in our non-safety-critical domains is study our systems to identify where the potential problems are, so that we can address them. Since we can’t actually quantify risk, the best we can do is to get better at identifying systemic issues. We need to always be looking for problems in the system, regardless of how many nines of availability we achieved last quarter. After all, that next major outage is always just around the corner.

The power of functionalism

Most software engineers are likely familiar with functional programming. The idea of functionalism, focusing on the “what” rather than the “how”, doesn’t just apply to programming. I was reminded of how powerful a functionalist approach is this week as while I’ve been attending the STAMP workshop. STAMP is an approach to systems safety developed by Nancy Leveson.

The primary metaphor in STAMP is the control system: STAMP employs a control system model to help reason about the safety of a system. This is very much a functionalist approach, as it models agents in the system based only on what control actions they can take and what feedback they can receive. You can use this same model to reason about a physical component, a software system, a human, a team, an organization, even a regulatory body. As long as you can identify the inputs your component receives, and the control actions that it can perform, you can model it as a control system.

Cognitive systems engineering (CSE) uses a different metaphor: that of a cognitive system. But CSE also takes a functional approach, observing how people actually work and trying to identify what functions their actions serve in the system. It’s a bottom-up functionalism where STAMP is top-down, so it yields different insights into the system.

What’s appealing to me about these functionalist approaches is that they change the way I look at a problem. They get me to think about the problem or system at hand in a different way than I would have if I didn’t take a deliberately take a functional approach. And “it helped me look at the world in a different way” is the highest compliment I can pay to a technology.

“How could they be so stupid?”

From the New York Times story on the recent Twitter hack:

Mr. O’Connor said other hackers had informed him that Kirk got access to the Twitter credentials when he found a way into Twitter’s internal Slack messaging channel and saw them posted there, along with a service that gave him access to the company’s servers. 

It’s too soon after this incident to put too much faith in the reporting, but let’s assume it’s accurate. A collective cry of “Posting credentials to a Slack channel? How could engineers at Twitter be so stupid?” rose up from the internet. It’s a natural reaction, but it’s not a constructive one.

I don’t personally know any engineers at Twitter, but I have confidence that they have excellent engineers over there, including excellent security folks. So, how do we explain this seemingly obvious security lapse?

The problem is that we on the outside can’t, because we don’t have enough information. This type of lapse is a classic example of a workaround. People in a system use workarounds (they do things the “wrong” way) when there are obstacles to doing things the “right” way.

There are countless possibilities for why people employ workarounds. Maybe some system that’s required for doing it the “right” way is down for some reason, or maybe it simply takes too long or is too hard to do things the “right” way. Combine that with production pressures, and a workaround is born.

I’m willing to bet that there are people in your organization that use workarounds. You probably use some yourself. Identifying those workarounds teaches us something about how the system works, and how people have to do things the “wrong” way to actually get their work done.

Some workarounds, like the Twitter example, are dangerous. But simply observing “they shouldn’t have done that” does nothing to address the problems in the system that motivated the workaround in the first place.

When you see a workaround, don’t ask “how could they be so stupid to do things the obviously wrong way?” Instead, ask “what are the properties of our system that contributed to the development of this workaround?” Because, unless you gain a deeper understanding of your system, the problems that motivated the workaround aren’t going to go away.

A reasonable system

Reasonable is an adjective we typically apply to humans, or something we implore of them (“Be reasonable!”). And, while I do want reasonable colleagues, what I really want is a reasonable system.

By reasonable system, I mean a system whose behavior I can reason about, both backwards and forwards in time. Given my understanding of how the system works, and the signals that are emitted by the system, I want to be able to understand its past behavior, and predict what its behavior is going to be in the future.

Who’s afraid of serializability?

Kyle Kingsbury’s Jepsen recently did an analysis of PostgreSQL 12.3 and found that under certain conditions it violated guarantees it makes about transactions, including violations of the serializability transaction isolation level.

I thought it would be fun to use one of his counterexamples to illustrate what serializable means.

Here’s one of the counterexamples that Jepsen’s tool, Elle, found:

In this counterexample, there are two list objects, here named 1799 and 1798, which I’m going to call x and y. The examples use two list operations, append (denoted "a") and read (denoted "r").

Here’s my redrawing of the example. I’ve drawn all operations against x in blue and against y in red. Note that I’m using empty list ([]) instead of nil.

There are two transactions, which I’ve denoted T1 and T2, and each one involves operations on two list objects, denoted x and y. The lists are initially empty.

For transactions that use the serializability isolation model, all of the operations in all of the transactions have to be consistent with some sequential ordering of the transactions. In this particular example, that means that all of the operations have to make sense assuming either:

  • all of the operations in T1 happened before all of the operations in T2
  • all of the operations in T2 happened before all of the operations in T1

Assume order: T1, T2

If we assume T1 happened before T2, then the operations for x are:

      x = [] 
T1:   x.append(2)
T2:   x.read() → []

This history violates the contract of a list: we’ve appended an element to a list but then read an empty list. It’s as if the append didn’t happen!

Assume order: T2, T1

If we assume T2 happened before T1, then the operations for y are:

      y = []
T2:   y.append(4)
      y.append(5)
      y.read() → [4, 5]
T1:   y.read() → []

This history violates the contract of a list as well: we read [4, 5] and then [ ]: it’s as if the values disappeared!

Kingsbury indicates that this pair of transactions are illegal by annotating the operations with arrows that show required orderings. The "rw" arrow means that the read operation that happened in the tail must be ordered before the write operation at the head of the arrow. If the arrows form a cycle, then the example violates serializability: there’s no possible ordering that can satisfy all of the arrows.

Serializability, linearizability, locality

This example is a good illustration of how serializability differs from linearizability. Lineraizability is a consistency model that also requires that operations must be consistent with sequential ordering. However, linearizability is only about individual objects, where transactions refer to collections of objects.

(Linearizability also requires that if operation A happens before operation B in time, then operation A must take effect before operation B, and serializability doesn’t require that, but let’s put that aside for now).

This counterexample above is a linearizable history: we can order the operations such that they are consistent with the contracts of x and y. Here’s an example of a valid history, which is called a linearization:

x = []
y = []
x.read() → []
x.append(2)
y.read() → []
y.append(4)
y.append(5)
y.read() → [4, 5]

Note how the operations between the two transactions are interleaved. This is forbidden by transactional isolation, but the definition of linearizability does not take into account transactions.

This example demonstrates how it’s possible to have histories that are linearizable but not serializable.

We say that lineariazibility is a local property where serializability is not: by the definition of linearizability, we can identify if a history is linearizable by looking at the histories of the individual objects (x, y). However, we can’t do that for serializability.

SRE, CSE, and the safety boundary

Site reliability engineering (SRE) and cognitive systems engineering (CSE) are two fields seeking the same goal: helping to design, build, and operate complex, software-intensive systems that stay up and running. They both worry about incidents and human workload, and they both reason about systems in terms of models. But their approaches are very different, and this post is about exploring one of those differences.

Caveat: I believe that you can’t really understand a field unless you either have direct working experience, or you have observed people doing work in the field. I’m not a site reliability engineer or a cognitive systems engineer, nor have I directly observed SREs or CSEs at work. This post is an outsider’s perspective on both of these fields. But I think it holds true to the philosophies that these approaches espouse publicly. Whether it corresponds to the actual day-to-day work of SREs and CSEs, I will leave to the judgment of the folks on the ground who actually do SRE or CSE work.

A bit of background

Site reliability engineering was popularized by Google, and continues to be strongly associated with the company. Google has published three O’Reilly books, the first one in 2016. I won’t say any more about the background of SRE here, but there are many other sources (including the Google books) for those who want to know more about the background.

Cognitive systems engineering is much older, tracing its roots back to the early eighties. If SRE is, as Ben Treynor described it what happens when you ask a software engineer to design an operations function, then CSE is what happens when you ask a psychologist how to prevent nuclear meltdowns.

CSE emerged in the wake of the Three Mile Island accident of 1979, where researchers were trying to make sense of how the accident happened. Before Three Mile Island, research on "human factors" aspects of work had focused on human physiology (for example, designing airplane cockpits), but after TMI the focused expanded to include cognitive aspects of work. The two researchers most closely associated with CSE, Erik Hollnagel and David Woods, were both trained as psychology researchers: their paper Cognitive Systems Engineering: New wine in new bottles marks the birth of the field (Thai Wood covered this paper in his excellent Resilience Roundup newsletter).

CSE has been applied in many different domains, but I think it would be unknown in the "tech" community were it not for the tireless efforts of John Allspaw to popularize the results of CSE research that has been done in the past four decades.

A useful metaphor: Rasmussen’s dynamic safety model

Jens Rasmussen was a Danish safety researcher whose work remains deeply influential in CSE. In 1997 he published a paper titled Risk management in a dynamic society: a modelling problem. This paper introduced the metaphor of the safety boundary, as illustrated in the following visual model, which I’ve reproduced from this paper:

Rasmussen viewed a safety-critical system as a point that moves inside of a space enclosed by three boundaries.

At the top right is what Rasmussen called the "boundary to economic failure". If the system crosses this boundary, then the system will fail due to poor economic performance. We know that if we try to work too quickly, we sacrifice safety. But we can’t work arbitrarily slowly to increase safety, because then we won’t get anything done. Management naturally puts pressure on the system to move away from this boundary.

At the bottom right is what Rasmussen called the "boundary of unacceptable work load". Management can apply pressure on the workforce to work both safely and quickly, but increasing safety and increasing productivity both require effort on behalf of practitioners, and there are limits to the amount of work that people can do. Practitioners naturally put pressure on the system to move away from this boundary.

At the left, the diagram has two boundaries. The outer boundary is what Rasmussen called the "boundary of functionally acceptable performance", what I’ll call the safety boundary. If the system crosses this boundary, an incident happens. We can never know exactly where this boundary is. The inner boundary is labelled "resulting perceived boundary of acceptable performance". That’s where we think the boundary is, and where we try to stay away from.

SRE vs CSE in context of the dynamic safety model

I find the dynamic safety model useful because I think it illustrates the difference in focus between SRE and CSE.

SRE focuses on two questions:

  1. How do we keep the system away from the safety boundary?
  2. What do we do once we’ve crossed the boundary?

To deal with the first question, SRE thinks about issues such as how to design systems and how to introduce changes safely. The second question is the realm of incident response.

CSE, on the other hand, focuses on the following questions:

  1. How will the system behave near the system boundary?
  2. How should we take this boundary behavior into account in our design?

CSE focuses on the space near the boundary, both to learn how work is actually done, and to inform how we should design tools to better support this work. In the words of Woods and Hollnagel:

> Discovery is aided by looking at situations that are near the margins of practice and when resource saturation is threatened (attention, workload, etc.). These are the circumstances when one can see how the system stretches to accommodate new demands, and the sources of resilience that usually bridge gaps. – Joint Cognitive Systems: Patterns in Cognitive Systems Engineering, p37

Fascinatingly, CSE has also identified common patterns of system behavior at the boundary that holds across multiple domains. But that will have to wait for a different post.

Reading more about CSE

I’m still a novice in the field of cognitive systems engineering. I’m actually using these posts to help learn through explaining the concepts to others.

The source I’ve found most useful so far is the book Joint Cognitive Systems: Patterns in Cognitive Systems Engineering , which is referenced in this post. If you prefer videos, Cook’s Lectures on the study of cognitive work is excellent.

I’ve also started a CSE reading list.

This mess we’re in

Most real software systems feel “messy” to the engineers who work on them. I’ve found that software engineers tend to fall into one of two camps on the nature of this messiness.

Camp 1: Problems with the design

One camp believes that the messiness is primarily related to sub-optimal design decisions. These design decisions might simply be poor decisions, or they might be because we aren’t able to spend enough time getting the design right.

My favorite example of this school of thought can be found in the text of Leslie Lamport’s talk entitled The Future of Computing: Logic or Biology:

The best way to cope with complexity is to avoid it. Programs that do a lot for us are inherently complex. But instead of surrendering to complexity, we must control it. We must keep our systems as simple as possible. Having two different kinds of drag-and-drop behavior is more complicated than having just one. And that is needless complexity.

We must keep our systems simple enough so we can understand them.
And the fundamental tool for doing this is the logic of mathematics.

Leslie Lamport, The Future of Computing: Logic or Biology

Camp 2: The world is messy

Engineers in the second camp believe that reality is just inherently messy, and that mess ends up being reflected in software systems that have to model the world. Rich Hickey describes this in what he calls “situated programs” (emphasis mine)

And they [situated programs] deal with real-world irregularity. This is the other thing I think that’s super-critical, you know, in this situated programming world. It’s never as elegant as you think, the real-world.

And I talked about that scheduling problem of, you know, those linear times, somebody who listens all day, and the somebody who just listens while they’re driving in the morning and the afternoon. Eight hours apart there’s one set of people and, then an hour later there’s another set of people, another set. You know, you have to think about all that time. You come up with this elegant notion of multi-dimensional time and you’d be like, “oh, I’m totally good…except on Tuesday”. Why? Well, in the U.S. on certain kinds of genres of radio, there’s a thing called “two for Tuesday”. Right? So you built this scheduling system, and the main purpose of the system is to never play the same song twice in a row, or even pretty near when you played it last. And not even play the same artist near when you played the artist, or else somebody’s going to say, “all you do is play Elton John, I hate this station”.

But on Tuesday, it’s a gimmick. “Two for Tuesday” means, every spot where we play a song, we’re going to play two songs by that artist. Violating every precious elegant rule you put in the system. And I’ve never had a real-world system that didn’t have these kinds of irregularities. And where they weren’t important.

Rich Hickey, Effective Programs: 10 Years of Clojure

It will come as no surprise to readers of this blog that I fall into the second camp. I do think that sub-optimal design decisions also contribute to messiness in software systems, but I think those are inevitable because unexpected changes and time pressures are inescapable. This is the mess we’re in.

Why you can’t just ask “why”

Today, most AI work is based on neural networks, but back in the 1980s, AI researchers were using a different approach: they built rule-based systems using mathematical logic. This was the heyday of Lisp and Prolog, which were well-suited towards implementing these systems.

One approach AI researchers used was to sit down with an expert and elicit the rules they used to perform a task. For example, an AI researcher might conduct a series of interviews with a doctor in order to determine how the doctor diagnosed illnesses based on symptoms. The researcher would then encode those rules to build an expert system: a software package that would, ideally, perform tasks as well as an expert.

Alas, the results were disappointing: these expert systems never measured up to the performance of those human experts. Two brothers: Stuart Dreyfus (a professor of industrial engineering and operations research) and Hubert Dreyfus (a professor of philosophy) published a book in 1998 titled Mind Over Machine that described why this approach to building expert systems by eliciting and encoding rules from experts could never really work. It turns out that experts don’t actually solve problems by following a set of rules. Instead, they rely more on intuition and pattern-matching based on a repertoire of cases they’ve built up from their experience1.

Yet, even though those experts didn’t solve problems by following rules, they were still able to articulate a set of rules that they claim to follow when asked. And they weren’t trying to deceive the AI researchers. Instead, something else was going on. The experts were inventing explanations without even being aware that they were doing so. Philosophers of mind use the term confabulation (technically broad confabulation) to refer to this phenomenon: how people will unknowingly fabricate explanations for their actions.

And therein lies the problem of asking “why”.

In the wake of an incident, we often want to understand why it is people did certain things: both for the people whose actions contributed to the incident (why did they make a global configuration change?) and for people whose actions mitigated the incident (why did they suspect a retry storm rather than a DDOS attack?)

The problem is, you can’t just ask people why, because people confabulate. You can, of course, simply ask people why they took the actions they did. Heck, you might even get a confident, articulated explanation. But you shouldn’t believe that the explanation they give corresponds to reality.

Yet, getting at the why is important. This is not a case of “‘Why?’ is the wrong question the way that Five Whys style questions are. There is real value in understanding how people came to the decisions they did, by learning about the signals they received at the time, and how their previous experiences shaped their perspectives. That’s where having a skilled interviewer comes in.

A skilled interviewer will increase the chances of getting an accurate response by asking questions to bring the interviewee back into the frame of mind that they were in during the incident. Instead of asking for an engineer to explain their actions (Why did you do X?), they’ll ask questions to try to jog their memory of what they were experiencing during the incident: What were you doing when the page went off? Where did you look first? What did you see? And then what did you do? Because we know that experts do pattern-matching, they’ll also ask questions like, have you ever seen this symptom before? These questions can elicit responses about previous experiences they’ve had in similar situations, which can provide context on how they made their decisions in this case.

Eliciting this sort of information from an interview is hard, and it takes real skill. We should take this sort of work seriously.

1The field of research known as naturalistic decision making studies how experts make decisions.

Asking the right “why” questions

In the [Cognitive Systems Engineering] terminology, it is more important to understand what a joint cognitive system (JCS) does and why it does it, than to explain how it does it. [emphasis in the original]

Erik Hollnagel & David D. Woods, Joint Cognitive Systems: Foundations of Cognitive Systems Engineering, p22

In my previous post, I linked to a famous essay by John Allspaw: The Infinite Hows (or, the Dangers Of The Five Whys). The main thrust of Allspaw’s essay can be summed up in this five word excerpt:

“Why?” is the wrong question.

As illustrated by the quote from Hollnagel & Woods at the top of this post, it turns out that cognitive systems engineering (CSE) is very big on answering “why” questions. Allspaw’s perspective on incident analysis is deeply influenced by research from cognitive systems engineering. So what’s going on here?

It turns out that the CSE folks are asking different kinds of “why” questions than the root cause analysis (RCA) folks. The RCA folks ask why did this incident happen? The CSE folks ask why did the system adapt the sorts of behaviors that contributed to the incident?

Those questions may sound similar, but they start from opposite assumptions. The RCA folks start with the assumption that there’s some sort of flaw in the system, a vulnerability that was previously unknown, and then base their analysis on identifying what that vulnerability was.

The CSE folks, on the other hand, start with the assumption that behaviors exhibited by the system developed through adaptation to existing constraints. The “why” question here is “why is this behavior adaptive? What purpose does it serve in the system?” Then they base the analysis on identifying attributes of the system such as constraints and goal conflicts that would explain why this behavior is adaptive.

This is one of the reasons why the CSE folks are so interested in incidents to begin with: because it can expose these kinds of constraints and conflicts that are part of the context of a system. It’s similar to how psychologists use optical illusions to study the heuristics that the human visual system employs: you look at the circumstances under which a system fails to get some insight into how it normally functions as well as it does.

“Why” questions can be useful! But you’ve got to ask the right ones.