I’m in the second week of the three week virtual MIT STAMP workshop. Today, Prof. Nancy Leveson gave a talk titled Safety Assurance (Safety Case): Is it Possible? Feasible? Safety assurance refers to the act of assuring that a system is safe, after the design has been completed.
Leveson is a skeptic of evaluating the safety of a system. Instead, she argues for focusing on generating safety requirements at the design stage so that safety can be designed in, rather than doing an evaluation post-design. (You can read her white paper for more details on her perspective). Here are the last three bullets from her final slide:
- If you are using hazard analysis to prove your system is safe, then you are using it wrong and your goal is futile
- Hazard analysis (using any method) can only help you find problems, it cannot prove that no problems exist
- The general problem is in setting the right psychological goal. It should not be “confirmation,” but exploration
This perspective resonated with me, because it matches how I think about availability metrics. You can’t use availability metrics to inform you about whether your system is reliable enough, because they can only tell you if you have a problem. If your availability metrics look good, that doesn’t tell you anything about how to spend your engineering resources on reliability.
As Leveson remarked about safety, I think the best we can do in our non-safety-critical domains is study our systems to identify where the potential problems are, so that we can address them. Since we can’t actually quantify risk, the best we can do is to get better at identifying systemic issues. We need to always be looking for problems in the system, regardless of how many nines of availability we achieved last quarter. After all, that next major outage is always just around the corner.