Thoughts on STAMP

STAMP is an accident model developed by the safety researcher Nancy Leveson. MIT runs annual STAMP workshop, and this year’s workshop was online due to COVID-19. This was the first time I’d attended the workshop. I’d previously read a couple of papers on STAMP, as well as Leveson’s book Engineering A Safer World, but I’ve never used the two associated processes: STPA for identifying safety requirements, and CAST for accident analysis.

This post captures my impressions of STAMP after attending the workshop.

But before I can talk about STAMP, I have to explain the safety term hazards.

Hazards

Safety engineers work to prevent unacceptable losses. The word safety typically invokes losses such as human death or injury, but it can be anything (e.g., property damage, mission loss).

Engineers have control over the systems that are designing, but they don’t have control over the environment that the system is embedded within. As a consequence, safety engineers focus their work on states that the system could be in that could lead to a loss.

As an example, consider the scenario where an autonomous vehicle is is driving very close to another vehicle immediately in front of it. An accident may or may not happen depending on whether the vehicle in front slams on the breaks. An autonomous vehicle designer has no control over whether another driver slams on the brakes (that’s part the an environment), but they can design the system to control the separation distance from the vehicle ahead. Thus, we’d say that the hazard is that minimum separation between the two cars is violated. Safety engineers work to identify, prevent, and mitigate hazards.

As a fan of software formal methods, I couldn’t help thinking about what the formal methods folks call safety properties. A safety property in the formal methods sense is a system state that we want to guarantee is unreachable. (An example of a safety property would be: no two threads can be in the same critical section at the same time).

The difference between hazards in the safety sense and safety properties in the software formal methods sense is:

  • a software formal methods safety property is always a software state
  • a safety hazard is never a software state

Software can contribute to a hazard occurring, but a software state by itself is never a hazard. Leveson noted in the workshop that software is just a list of instructions: and a list of instructions can’t hurt a human being. (Of course, software can cause something else to hurt a human being! But that’s an explanation for how the system got into a hazardous state, it’s not the hazard).

The concept of hazards isn’t specific to STAMP. There are a number of other safety techniques for doing hazard analysis, such as HAZOP, FMEA, FMECA, and fault tree analysis. However, I’m not familiar with those techniques so I won’t be drawing any comparisons here.


If I had to sum up STAMP in two words, they would be: control and design. STAMP is about control, in two senses.

Control

In one sense, STAMP uses a control system model for doing hazard analysis. This is a functional model of controllers which interact with each other through control actions and feedback. Because it’s a functional model, a controller is just something that exercises control: a software-based subsystem, a human operator, a team of humans, even an entire organization, each would be modeled as a controller in a description of the system used for doing the hazard analysis.

In another sense, STAMP is about achieving safety through controlling hazards: once they’re identified, you use STAMP to generate safety requirements for controls for preventing or mitigating the hazards.

Design

STAMP is very focused on achieving safety through proper system design. The idea is that you identify safety requirements for controlling hazards, and then you make sure that your design satisfies those safety requirements. While STAMP influences the entire lifecycle, the sense I got from Nancy Leveon’s talks was that accidents can usually be attributed to hazards that were not controlled for by the system designer.


Here are some general, unstructured impressions I had of both STAMP and Nancy Leveson. I felt I got a much better sense of STPA than CAST during the workshop, so most of my impressions are about STPA.

Where I’m positive

I like STAMP’s approach of taking a systems view, and making a sharp distinction between reliability and safety. Leveson is clearly opposed to the ideas of root cause and human error as explanations for accidents.

I also like the control structure metaphor, because it encourages me to construct a different kind of functional model of the system to help reason about what can go wrong. I also liked how software and humans were modeled the same way, as controllers, although I also think this is problematic (see below). It was interesting to contrast the control system metaphor with the joint cognitive system metaphor employed by cognitive systems engineering.

Although I’m not in a safety-critical industry, I think I could pick up and apply STPA in my domain. Akamai is an existence proof that you can use this approach in a non-safety-critical software context. Akamai has even made their STAMP materials available online.

I was surprised how influenced Leveson was by some of Sidney Dekker’s work, given that she has a very different perspective than him. She mentioned “just culture” several times, and even softened the language around “unsafe control actions” used in the CAST incident analysis process because of the blame connotations.

I enjoyed Leveson’s perspective on probability risk assessment: that it isn’t useful to try to compute risk probabilities. Rather, you build an exploratory model, identify the hazards, and control for them.Very much in tune with my own thoughts on using metrics to assess how you’re doing versus identifying problems. STPA feels like a souped-up version of Gary Klein’s premortem method of risk assessment.

I liked the structure of the STPA approach, because it provides a scaffolding for novices to start with. Like any approach, STPA requires expertise to use effectively, but my impression is that the learning curve isn’t that steep to get started with it.

Where I’m skeptical

STAMP models humans in the system the same way it models all other controllers: using a control algorithm and a process model. I feel like this is too simplistic. How will people use workarounds to get their work done in ways that vary from the explicitly documented procedures? Similarly, there was nothing at all about anomaly response, and it wasn’t clear to me how you would model something like production pressure.

I think Leveson’s response would be that if people employ workarounds that lead to hazards, that indicates that there was a hazard missed by the designer during the hazard analysis. But I worry that these workarounds will introduce new control loops that the designer wouldn’t have known about. Similarly, Leveson isn’t interested in issues around uncertainty and human judgment of operators, because those likewise indicates flaws in the design to control the hazards.

STAMP generates recommendations and requirements, and these will change the system, and these can reverberate through the system, introducing new hazards. While the conference organizers were aware of this, it wasn’t as much of a first class concept as I would have liked: the discussions in the workshop tended to end at the recommendation level. There was a little bit of discussion about how to deal with the fact that STAMP can introduce new hazards, but I would have liked to have seen more. It wasn’t clear to me how you could use a control system model to solve the envisioned world problem: how will the system change how people work?

Leveson is very down on agile methods for safety-critical systems. I don’t have a particular opinion there, but in my world, agile makes a lot of sense, and our system is in constant flux. I don’t think any subset of engineers has enough knowledge of the system to sketch out a control model that would capture all of the control loops that are present at any one point in time. We could do some stuff, and find some hazards, and that would be useful! But I believe that in my context, this limits our ability to fully explore the space of potential hazards.

Leveson is also very critical on Hollnagel’s idea of Safety-II, and she finds his portrayal of Safety-I as unrecognizable in her own experiences doing safety engineering. She believes that depending on the adaptability of the human operators is akin to blaming accidents on human error. That sounded very strange to my areas.

While there wasn’t as much discussion of CAST in the workshop itself, there was some, and I did skim the CAST handbook as well. CAST was too big on “why” rather than “how” for my taste, and many of the questions generated by CAST are counterfactuals, which I generally have an allergic reaction to (see page 40 of the handbook).

During the workshop, I asked if the workshop organizers (Nancy Leveson and John Thomas) if they could think of a case where a system designed using STPA had an accident, and they did not know of an example where that had happened, and that took me aback. They both insisted that STPA wasn’t perfect, but it worries me that they hadn’t yet encountered a situation that revealed any weaknesses in the approach.

Outro

I’m always interested in approaches that can help us identify problems, and STAMP looks promising on that front. I’m looking forward to opportunities to get some practice with STPA and CAST.

However, I don’t believe we can control all of the hazards in our designs, at least, not in the context that I work in. I think there are just too many interactions in the system, and the system is undergoing too much change every day. And that means that, while STAMP can help me find some of the hazards, I have to worry about phenomena like anomaly response, and how well the human operators can adapt when the anomalies inevitably happen.

In praise of the Wild West engineer

If you put software engineers on a continuum with slow and cautious at one end and Wild West at the other end, I’ve always been on the slow and cautious side. But I’ve come to appreciate the value of the Wild West engineer.

Here’s an example of Wild West behavior: during some sort of important operational task (let’s say a failover), the engineer carrying it out sees a Python stack trace appear in the UI. Whoops, there’s a bug in the operational code! They ssh to the box, fix the code, and then run it again. It works!

This sort of behavior used to horrify me. How could you just hotfix the code by ssh’ing to the box and updating the code by hand like that? Do you know how dangerous that is? You’re supposed to PR, run tests, and deploy a new image!

But here’s the thing. During an actual incident, the engineers involved will have to take risky actions in order to remediate. You’ve got to poke and prod at the system in a way that’s potentially unsafe in order to (hopefully!) make things better. As Richard Cook wrote, all practitioner actions are gambles. The Wild West engineers are the ones with the most experience making these sorts of production changes, so they’re more likely to be successful at them in these circumstances.

I also don’t think that a Wild West engineer is someone who simply takes unnecessary risks. Rather, they have a well-calibrated sense of what’s risky and what isn’t. In particular, if something breaks, they know how to fix it. Once, years ago, during an incident, I made a global change to a dynamic property in order to speed up a remediation, and a Wild West engineer I knew clucked with disapproval. That was a stupid risk I took. You always stage your changes across regions!

Now, simply because the Wild West engineers have a well-calibrated sense of risk, doesn’t mean their sense is always correct! David Woods notes that all systems are simultaneously well-adapted, under-adapted, and over-adapted. The Wild West engineer might miscalculate a risk But I think it’s a mistake to dismiss Wild West engineers as simply irresponsible. While I’m still firmly the slow-and-cautious type, when everything is on fire, I’m happy to have the Wild West engineers around to take those dangerous remediation actions. Because if it ends up making things worse, they’ll know how to handle it. They’ve been there before.

Safe by design?

I’ve been enjoying the ongoing MIT STAMP workshop. In particular, I’ve been enjoying listening to Nancy Leveson talk about system safety. Leveson is a giant in the safety research community (and, incidentally, an author of my favorite software engineering study). She’s also a critic of root cause and human error as explanations for accidents. Despite this, she has a different perspective on safety than many in the resilience engineering community. To sharpen my thinking, I’m going to capture my understanding of this difference in this post below.

From Leveson’s perspective, the engineering design should ensure that the system is safe. More specifically, the design should contain controls that eliminate or mitigate hazards. In this view, accidents are invariably attributable to design errors: a hazard in the system was not effectively controlled in the design.

By contrast, many in the resilience engineering community claim that design alone cannot ensure that the system is safe. The idea here is that the system design will always be incomplete, and the human operators must adapt their local work to make up for the gaps in the designed system. These adaptations usually contribute to safety, and sometimes contribute to incidents, and in post-incident investigations we often only notice the latter case.

These perspectives are quite different. Leveson believes that depending on human adaptation in the system is itself dangerous. If we’re depending on human adaptation to achieve system safety, then the design engineers have not done their jobs properly in controlling hazards. The resilience engineering folks believe that depending on human adaptation is inevitable, because of the messy nature of complex systems.

All we can do is find problems

I’m in the second week of the three week virtual MIT STAMP workshop. Today, Prof. Nancy Leveson gave a talk titled Safety Assurance (Safety Case): Is it Possible? Feasible? Safety assurance refers to the act of assuring that a system is safe, after the design has been completed.

Leveson is a skeptic of evaluating the safety of a system. Instead, she argues for focusing on generating safety requirements at the design stage so that safety can be designed in, rather than doing an evaluation post-design. (You can read her white paper for more details on her perspective). Here are the last three bullets from her final slide:

  • If you are using hazard analysis to prove your system is safe, then you are using it wrong and your goal is futile
  • Hazard analysis (using any method) can only help you find problems, it cannot prove that no problems exist
  • The general problem is in setting the right psychological goal. It should not be “confirmation,” but exploration

This perspective resonated with me, because it matches how I think about availability metrics. You can’t use availability metrics to inform you about whether your system is reliable enough, because they can only tell you if you have a problem. If your availability metrics look good, that doesn’t tell you anything about how to spend your engineering resources on reliability.

As Leveson remarked about safety, I think the best we can do in our non-safety-critical domains is study our systems to identify where the potential problems are, so that we can address them. Since we can’t actually quantify risk, the best we can do is to get better at identifying systemic issues. We need to always be looking for problems in the system, regardless of how many nines of availability we achieved last quarter. After all, that next major outage is always just around the corner.

The power of functionalism

Most software engineers are likely familiar with functional programming. The idea of functionalism, focusing on the “what” rather than the “how”, doesn’t just apply to programming. I was reminded of how powerful a functionalist approach is this week as while I’ve been attending the STAMP workshop. STAMP is an approach to systems safety developed by Nancy Leveson.

The primary metaphor in STAMP is the control system: STAMP employs a control system model to help reason about the safety of a system. This is very much a functionalist approach, as it models agents in the system based only on what control actions they can take and what feedback they can receive. You can use this same model to reason about a physical component, a software system, a human, a team, an organization, even a regulatory body. As long as you can identify the inputs your component receives, and the control actions that it can perform, you can model it as a control system.

Cognitive systems engineering (CSE) uses a different metaphor: that of a cognitive system. But CSE also takes a functional approach, observing how people actually work and trying to identify what functions their actions serve in the system. It’s a bottom-up functionalism where STAMP is top-down, so it yields different insights into the system.

What’s appealing to me about these functionalist approaches is that they change the way I look at a problem. They get me to think about the problem or system at hand in a different way than I would have if I didn’t take a deliberately take a functional approach. And “it helped me look at the world in a different way” is the highest compliment I can pay to a technology.

“How could they be so stupid?”

From the New York Times story on the recent Twitter hack:

Mr. O’Connor said other hackers had informed him that Kirk got access to the Twitter credentials when he found a way into Twitter’s internal Slack messaging channel and saw them posted there, along with a service that gave him access to the company’s servers. 

It’s too soon after this incident to put too much faith in the reporting, but let’s assume it’s accurate. A collective cry of “Posting credentials to a Slack channel? How could engineers at Twitter be so stupid?” rose up from the internet. It’s a natural reaction, but it’s not a constructive one.

I don’t personally know any engineers at Twitter, but I have confidence that they have excellent engineers over there, including excellent security folks. So, how do we explain this seemingly obvious security lapse?

The problem is that we on the outside can’t, because we don’t have enough information. This type of lapse is a classic example of a workaround. People in a system use workarounds (they do things the “wrong” way) when there are obstacles to doing things the “right” way.

There are countless possibilities for why people employ workarounds. Maybe some system that’s required for doing it the “right” way is down for some reason, or maybe it simply takes too long or is too hard to do things the “right” way. Combine that with production pressures, and a workaround is born.

I’m willing to bet that there are people in your organization that use workarounds. You probably use some yourself. Identifying those workarounds teaches us something about how the system works, and how people have to do things the “wrong” way to actually get their work done.

Some workarounds, like the Twitter example, are dangerous. But simply observing “they shouldn’t have done that” does nothing to address the problems in the system that motivated the workaround in the first place.

When you see a workaround, don’t ask “how could they be so stupid to do things the obviously wrong way?” Instead, ask “what are the properties of our system that contributed to the development of this workaround?” Because, unless you gain a deeper understanding of your system, the problems that motivated the workaround aren’t going to go away.

A reasonable system

Reasonable is an adjective we typically apply to humans, or something we implore of them (“Be reasonable!”). And, while I do want reasonable colleagues, what I really want is a reasonable system.

By reasonable system, I mean a system whose behavior I can reason about, both backwards and forwards in time. Given my understanding of how the system works, and the signals that are emitted by the system, I want to be able to understand its past behavior, and predict what its behavior is going to be in the future.

Who’s afraid of serializability?

Kyle Kingsbury’s Jepsen recently did an analysis of PostgreSQL 12.3 and found that under certain conditions it violated guarantees it makes about transactions, including violations of the serializability transaction isolation level.

I thought it would be fun to use one of his counterexamples to illustrate what serializable means.

Here’s one of the counterexamples that Jepsen’s tool, Elle, found:

In this counterexample, there are two list objects, here named 1799 and 1798, which I’m going to call x and y. The examples use two list operations, append (denoted "a") and read (denoted "r").

Here’s my redrawing of the example. I’ve drawn all operations against x in blue and against y in red. Note that I’m using empty list ([]) instead of nil.

There are two transactions, which I’ve denoted T1 and T2, and each one involves operations on two list objects, denoted x and y. The lists are initially empty.

For transactions that use the serializability isolation model, all of the operations in all of the transactions have to be consistent with some sequential ordering of the transactions. In this particular example, that means that all of the operations have to make sense assuming either:

  • all of the operations in T1 happened before all of the operations in T2
  • all of the operations in T2 happened before all of the operations in T1

Assume order: T1, T2

If we assume T1 happened before T2, then the operations for x are:

      x = [] 
T1:   x.append(2)
T2:   x.read() → []

This history violates the contract of a list: we’ve appended an element to a list but then read an empty list. It’s as if the append didn’t happen!

Assume order: T2, T1

If we assume T2 happened before T1, then the operations for y are:

      y = []
T2:   y.append(4)
      y.append(5)
      y.read() → [4, 5]
T1:   y.read() → []

This history violates the contract of a list as well: we read [4, 5] and then [ ]: it’s as if the values disappeared!

Kingsbury indicates that this pair of transactions are illegal by annotating the operations with arrows that show required orderings. The "rw" arrow means that the read operation that happened in the tail must be ordered before the write operation at the head of the arrow. If the arrows form a cycle, then the example violates serializability: there’s no possible ordering that can satisfy all of the arrows.

Serializability, linearizability, locality

This example is a good illustration of how serializability differs from linearizability. Lineraizability is a consistency model that also requires that operations must be consistent with sequential ordering. However, linearizability is only about individual objects, where transactions refer to collections of objects.

(Linearizability also requires that if operation A happens before operation B in time, then operation A must take effect before operation B, and serializability doesn’t require that, but let’s put that aside for now).

This counterexample above is a linearizable history: we can order the operations such that they are consistent with the contracts of x and y. Here’s an example of a valid history, which is called a linearization:

x = []
y = []
x.read() → []
x.append(2)
y.read() → []
y.append(4)
y.append(5)
y.read() → [4, 5]

Note how the operations between the two transactions are interleaved. This is forbidden by transactional isolation, but the definition of linearizability does not take into account transactions.

This example demonstrates how it’s possible to have histories that are linearizable but not serializable.

We say that lineariazibility is a local property where serializability is not: by the definition of linearizability, we can identify if a history is linearizable by looking at the histories of the individual objects (x, y). However, we can’t do that for serializability.

This mess we’re in

Most real software systems feel “messy” to the engineers who work on them. I’ve found that software engineers tend to fall into one of two camps on the nature of this messiness.

Camp 1: Problems with the design

One camp believes that the messiness is primarily related to sub-optimal design decisions. These design decisions might simply be poor decisions, or they might be because we aren’t able to spend enough time getting the design right.

My favorite example of this school of thought can be found in the text of Leslie Lamport’s talk entitled The Future of Computing: Logic or Biology:

The best way to cope with complexity is to avoid it. Programs that do a lot for us are inherently complex. But instead of surrendering to complexity, we must control it. We must keep our systems as simple as possible. Having two different kinds of drag-and-drop behavior is more complicated than having just one. And that is needless complexity.

We must keep our systems simple enough so we can understand them.
And the fundamental tool for doing this is the logic of mathematics.

Leslie Lamport, The Future of Computing: Logic or Biology

Camp 2: The world is messy

Engineers in the second camp believe that reality is just inherently messy, and that mess ends up being reflected in software systems that have to model the world. Rich Hickey describes this in what he calls “situated programs” (emphasis mine)

And they [situated programs] deal with real-world irregularity. This is the other thing I think that’s super-critical, you know, in this situated programming world. It’s never as elegant as you think, the real-world.

And I talked about that scheduling problem of, you know, those linear times, somebody who listens all day, and the somebody who just listens while they’re driving in the morning and the afternoon. Eight hours apart there’s one set of people and, then an hour later there’s another set of people, another set. You know, you have to think about all that time. You come up with this elegant notion of multi-dimensional time and you’d be like, “oh, I’m totally good…except on Tuesday”. Why? Well, in the U.S. on certain kinds of genres of radio, there’s a thing called “two for Tuesday”. Right? So you built this scheduling system, and the main purpose of the system is to never play the same song twice in a row, or even pretty near when you played it last. And not even play the same artist near when you played the artist, or else somebody’s going to say, “all you do is play Elton John, I hate this station”.

But on Tuesday, it’s a gimmick. “Two for Tuesday” means, every spot where we play a song, we’re going to play two songs by that artist. Violating every precious elegant rule you put in the system. And I’ve never had a real-world system that didn’t have these kinds of irregularities. And where they weren’t important.

Rich Hickey, Effective Programs: 10 Years of Clojure

It will come as no surprise to readers of this blog that I fall into the second camp. I do think that sub-optimal design decisions also contribute to messiness in software systems, but I think those are inevitable because unexpected changes and time pressures are inescapable. This is the mess we’re in.

Asking the right “why” questions

In the [Cognitive Systems Engineering] terminology, it is more important to understand what a joint cognitive system (JCS) does and why it does it, than to explain how it does it. [emphasis in the original]

Erik Hollnagel & David D. Woods, Joint Cognitive Systems: Foundations of Cognitive Systems Engineering, p22

In my previous post, I linked to a famous essay by John Allspaw: The Infinite Hows (or, the Dangers Of The Five Whys). The main thrust of Allspaw’s essay can be summed up in this five word excerpt:

“Why?” is the wrong question.

As illustrated by the quote from Hollnagel & Woods at the top of this post, it turns out that cognitive systems engineering (CSE) is very big on answering “why” questions. Allspaw’s perspective on incident analysis is deeply influenced by research from cognitive systems engineering. So what’s going on here?

It turns out that the CSE folks are asking different kinds of “why” questions than the root cause analysis (RCA) folks. The RCA folks ask why did this incident happen? The CSE folks ask why did the system adapt the sorts of behaviors that contributed to the incident?

Those questions may sound similar, but they start from opposite assumptions. The RCA folks start with the assumption that there’s some sort of flaw in the system, a vulnerability that was previously unknown, and then base their analysis on identifying what that vulnerability was.

The CSE folks, on the other hand, start with the assumption that behaviors exhibited by the system developed through adaptation to existing constraints. The “why” question here is “why is this behavior adaptive? What purpose does it serve in the system?” Then they base the analysis on identifying attributes of the system such as constraints and goal conflicts that would explain why this behavior is adaptive.

This is one of the reasons why the CSE folks are so interested in incidents to begin with: because it can expose these kinds of constraints and conflicts that are part of the context of a system. It’s similar to how psychologists use optical illusions to study the heuristics that the human visual system employs: you look at the circumstances under which a system fails to get some insight into how it normally functions as well as it does.

“Why” questions can be useful! But you’ve got to ask the right ones.