A couple of threads got me thinking about the limits of STAMP.
The first thread was sparked by a link to a Hacker News comment, sent to me be a colleague of mine, Danny Thomas. This introduced me to a concept I hadn’t of heard of before, a battleshort. There’s even an official definition in a NATO document:
The capability to bypass certain safety features in a system to ensure completion of the mission without interruption due to the safety feature
AOP-38, Allied Ordnance Publication 38, Edition 3, Glossary of terms and definitions concerning the safety and suitability for service of munitions, explosives and related products, April 2002.
The second thread was sparked by a Twitter exchange between a UK Southern Railway train driver and the official UK Southern Railway twitter account:
This is a great example of exapting, a concept introduced by the paleontologists Stephen Jay Gould and Elisabeth Vrba. Exaptation is a solution to the following problem in evolutionary biology: what good is a partially functional wing? Either an animal can fly or it can’t, and a fully functional wing can’t evolve in a single generation, so how do the initial evolutionary stages of a wing confer advantage on the organism?
The answer is that: while a partially functional wing might be useless for flight, it might still be useful as a fin. And so, if wings evolved from fins, then the appendage may always confer an advantage at each evolutionary stage. The fin is exapted into a wing; it is repurposed to serve a new function. In the Twitter example above, the railway driver repurposed a social media service for communicating with his own organization.
Which brings us back to STAMP. One of the central assumptions of STAMP is that it is possible to construct an accurate enough control model of the system at the design stage to identify all of the hazards and unsafe control actions. You can see this assumption in action in the CAST handbook (CAST is STAMP’s accident analysis process) in the example questions from page 40 of the handbook (emphasis mine), which uses counterfactual reasoning to try identify flaws in the original hazard analysis.
Did the design account for the possibility of this increased pressure? If not, why not? Was this risk assessed at the design stage?
This seems like a predictable design flaw. Was the unsafe interaction between the two requirements (preventing liquid from entering the flare and the need to discharge gases to the flare) identified in the design or hazard analysis efforts? If so, why was it not handled in the design or in operational procedures? If it was not identified, why not?
Why wasn’t the increasing pressure detected and handled? If there were alerts, why did they not result in effective action to handle the increasing pressure? If there were automatic overpressurization control devices (e.g., relief valves), why were they not effective? If there were not automatic devices, then why not? Was it not feasible to provide them?
Was this type of pressure increase anticipated? If it was anticipated, then why was it not handled in the design or operational procedures? If it was not anticipated, why not?
Was there any way to contain the contents within some controlled area (barrier), at least the catalyst pellets?
Why was the area around the reactor not isolated during a potentially hazardous operation? Why was there no protection against catalyst pellets flying around?
This line of reasoning assumes that all hazards are, in principle, identifiable at the design stage. I think that phenomena like battleshorts and exaptations make this goal unattainable.
Now, in principle, nothing prevents an engineer using STPA (STAMP’s hazard analysis technique) from identifying scenarios that involve battleshorts and exaptations. After all, STPA is an exploratory technique. But I suspect that many of these kinds of adaptations are literally unimaginable to the designers.
Missing here is the idea of normalization of deviance: using the battleshorts during ordinary operations.
Unfortunately, Vaughn’s label encourages a basic misconception: deviance. Relative to what? The sharp end is not deviating, they are adapting to make a degraded system work under pressure. This leads management, and even operators, open to keep going under degraded conditions rather than saying enough, slowing or stopping operations untill the dgraded condition is resolved. In other words, we cut corners, nothing happened, so we can cut more corners. Textbook case is Texas City BP disaster. Feynman’s disent in Challenger report famously contains the critique: because the system didn’t fail under degraded conditions does not mean the system is safe; one has to understand the underlying system/mechanisms at work. The deviation is from the organization’s model of safe operations which continues to deviate from reality – WAI vs WAD. The correction is the work to continuously seek feedback and moodel updating rather than discounting signs of anomalies. The label ‘Deviance’ only applies in an outside in perspective with knowledge of hindsight. The mechanisms include the sacrifice judgment/trade-off, the fluency law, operators as ad hoc source of adaptive capacity for messy systems, WAI-WAD gap, recalibrating/updating models of how system works, and more regularities and mechanisms uncovered in Resilience Engineering. Normalization of deviance points in a direction but much more is needed to see these mechanisms and hwo they combine.