STAMP is an accident model developed by the safety researcher Nancy Leveson. MIT runs annual STAMP workshop, and this year’s workshop was online due to COVID-19. This was the first time I’d attended the workshop. I’d previously read a couple of papers on STAMP, as well as Leveson’s book Engineering A Safer World, but I’ve never used the two associated processes: STPA for identifying safety requirements, and CAST for accident analysis.
This post captures my impressions of STAMP after attending the workshop.
But before I can talk about STAMP, I have to explain the safety term hazards.
Safety engineers work to prevent unacceptable losses. The word safety typically invokes losses such as human death or injury, but it can be anything (e.g., property damage, mission loss).
Engineers have control over the systems that are designing, but they don’t have control over the environment that the system is embedded within. As a consequence, safety engineers focus their work on states that the system could be in that could lead to a loss.
As an example, consider the scenario where an autonomous vehicle is is driving very close to another vehicle immediately in front of it. An accident may or may not happen depending on whether the vehicle in front slams on the breaks. An autonomous vehicle designer has no control over whether another driver slams on the brakes (that’s part the an environment), but they can design the system to control the separation distance from the vehicle ahead. Thus, we’d say that the hazard is that minimum separation between the two cars is violated. Safety engineers work to identify, prevent, and mitigate hazards.
As a fan of software formal methods, I couldn’t help thinking about what the formal methods folks call safety properties. A safety property in the formal methods sense is a system state that we want to guarantee is unreachable. (An example of a safety property would be: no two threads can be in the same critical section at the same time).
The difference between hazards in the safety sense and safety properties in the software formal methods sense is:
- a software formal methods safety property is always a software state
- a safety hazard is never a software state
Software can contribute to a hazard occurring, but a software state by itself is never a hazard. Leveson noted in the workshop that software is just a list of instructions: and a list of instructions can’t hurt a human being. (Of course, software can cause something else to hurt a human being! But that’s an explanation for how the system got into a hazardous state, it’s not the hazard).
The concept of hazards isn’t specific to STAMP. There are a number of other safety techniques for doing hazard analysis, such as HAZOP, FMEA, FMECA, and fault tree analysis. However, I’m not familiar with those techniques so I won’t be drawing any comparisons here.
If I had to sum up STAMP in two words, they would be: control and design. STAMP is about control, in two senses.
In one sense, STAMP uses a control system model for doing hazard analysis. This is a functional model of controllers which interact with each other through control actions and feedback. Because it’s a functional model, a controller is just something that exercises control: a software-based subsystem, a human operator, a team of humans, even an entire organization, each would be modeled as a controller in a description of the system used for doing the hazard analysis.
In another sense, STAMP is about achieving safety through controlling hazards: once they’re identified, you use STAMP to generate safety requirements for controls for preventing or mitigating the hazards.
STAMP is very focused on achieving safety through proper system design. The idea is that you identify safety requirements for controlling hazards, and then you make sure that your design satisfies those safety requirements. While STAMP influences the entire lifecycle, the sense I got from Nancy Leveon’s talks was that accidents can usually be attributed to hazards that were not controlled for by the system designer.
Here are some general, unstructured impressions I had of both STAMP and Nancy Leveson. I felt I got a much better sense of STPA than CAST during the workshop, so most of my impressions are about STPA.
Where I’m positive
I like STAMP’s approach of taking a systems view, and making a sharp distinction between reliability and safety. Leveson is clearly opposed to the ideas of root cause and human error as explanations for accidents.
I also like the control structure metaphor, because it encourages me to construct a different kind of functional model of the system to help reason about what can go wrong. I also liked how software and humans were modeled the same way, as controllers, although I also think this is problematic (see below). It was interesting to contrast the control system metaphor with the joint cognitive system metaphor employed by cognitive systems engineering.
Although I’m not in a safety-critical industry, I think I could pick up and apply STPA in my domain. Akamai is an existence proof that you can use this approach in a non-safety-critical software context. Akamai has even made their STAMP materials available online.
I was surprised how influenced Leveson was by some of Sidney Dekker’s work, given that she has a very different perspective than him. She mentioned “just culture” several times, and even softened the language around “unsafe control actions” used in the CAST incident analysis process because of the blame connotations.
I enjoyed Leveson’s perspective on probability risk assessment: that it isn’t useful to try to compute risk probabilities. Rather, you build an exploratory model, identify the hazards, and control for them.Very much in tune with my own thoughts on using metrics to assess how you’re doing versus identifying problems. STPA feels like a souped-up version of Gary Klein’s premortem method of risk assessment.
I liked the structure of the STPA approach, because it provides a scaffolding for novices to start with. Like any approach, STPA requires expertise to use effectively, but my impression is that the learning curve isn’t that steep to get started with it.
Where I’m skeptical
STAMP models humans in the system the same way it models all other controllers: using a control algorithm and a process model. I feel like this is too simplistic. How will people use workarounds to get their work done in ways that vary from the explicitly documented procedures? Similarly, there was nothing at all about anomaly response, and it wasn’t clear to me how you would model something like production pressure.
I think Leveson’s response would be that if people employ workarounds that lead to hazards, that indicates that there was a hazard missed by the designer during the hazard analysis. But I worry that these workarounds will introduce new control loops that the designer wouldn’t have known about. Similarly, Leveson isn’t interested in issues around uncertainty and human judgment of operators, because those likewise indicates flaws in the design to control the hazards.
STAMP generates recommendations and requirements, and these will change the system, and these can reverberate through the system, introducing new hazards. While the conference organizers were aware of this, it wasn’t as much of a first class concept as I would have liked: the discussions in the workshop tended to end at the recommendation level. There was a little bit of discussion about how to deal with the fact that STAMP can introduce new hazards, but I would have liked to have seen more. It wasn’t clear to me how you could use a control system model to solve the envisioned world problem: how will the system change how people work?
Leveson is very down on agile methods for safety-critical systems. I don’t have a particular opinion there, but in my world, agile makes a lot of sense, and our system is in constant flux. I don’t think any subset of engineers has enough knowledge of the system to sketch out a control model that would capture all of the control loops that are present at any one point in time. We could do some stuff, and find some hazards, and that would be useful! But I believe that in my context, this limits our ability to fully explore the space of potential hazards.
Leveson is also very critical on Hollnagel’s idea of Safety-II, and she finds his portrayal of Safety-I as unrecognizable in her own experiences doing safety engineering. She believes that depending on the adaptability of the human operators is akin to blaming accidents on human error. That sounded very strange to my areas.
While there wasn’t as much discussion of CAST in the workshop itself, there was some, and I did skim the CAST handbook as well. CAST was too big on “why” rather than “how” for my taste, and many of the questions generated by CAST are counterfactuals, which I generally have an allergic reaction to (see page 40 of the handbook).
During the workshop, I asked if the workshop organizers (Nancy Leveson and John Thomas) if they could think of a case where a system designed using STPA had an accident, and they did not know of an example where that had happened, and that took me aback. They both insisted that STPA wasn’t perfect, but it worries me that they hadn’t yet encountered a situation that revealed any weaknesses in the approach.
I’m always interested in approaches that can help us identify problems, and STAMP looks promising on that front. I’m looking forward to opportunities to get some practice with STPA and CAST.
However, I don’t believe we can control all of the hazards in our designs, at least, not in the context that I work in. I think there are just too many interactions in the system, and the system is undergoing too much change every day. And that means that, while STAMP can help me find some of the hazards, I have to worry about phenomena like anomaly response, and how well the human operators can adapt when the anomalies inevitably happen.