Safe by design?

I’ve been enjoying the ongoing MIT STAMP workshop. In particular, I’ve been enjoying listening to Nancy Leveson talk about system safety. Leveson is a giant in the safety research community (and, incidentally, an author of my favorite software engineering study). She’s also a critic of root cause and human error as explanations for accidents. Despite this, she has a different perspective on safety than many in the resilience engineering community. To sharpen my thinking, I’m going to capture my understanding of this difference in this post below.

From Leveson’s perspective, the engineering design should ensure that the system is safe. More specifically, the design should contain controls that eliminate or mitigate hazards. In this view, accidents are invariably attributable to design errors: a hazard in the system was not effectively controlled in the design.

"The operator's job is to make up for the holes in the designer's work." – Jens Rasmussen (1981)
— John Allspaw (@allspaw) July 18, 2012

By contrast, many in the resilience engineering community claim that design alone cannot ensure that the system is safe. The idea here is that the system design will always be incomplete, and the human operators must adapt their local work to make up for the gaps in the designed system. These adaptations usually contribute to safety, and sometimes contribute to incidents, and in post-incident investigations we often only notice the latter case.

These perspectives are quite different. Leveson believes that depending on human adaptation in the system is itself dangerous. If we’re depending on human adaptation to achieve system safety, then the design engineers have not done their jobs properly in controlling hazards. The resilience engineering folks believe that depending on human adaptation is inevitable, because of the messy nature of complex systems.

All we can do is find problems

I’m in the second week of the three week virtual MIT STAMP workshop. Today, Prof. Nancy Leveson gave a talk titled Safety Assurance (Safety Case): Is it Possible? Feasible? Safety assurance refers to the act of assuring that a system is safe, after the design has been completed.

Leveson is a skeptic of evaluating the safety of a system. Instead, she argues for focusing on generating safety requirements at the design stage so that safety can be designed in, rather than doing an evaluation post-design. (You can read her white paper for more details on her perspective). Here are the last three bullets from her final slide:

If you are using hazard analysis to prove your system is safe, then you are using it wrong and your goal is futile
Hazard analysis (using any method) can only help you find problems, it cannot prove that no problems exist
The general problem is in setting the right psychological goal. It should not be “confirmation,” but exploration

This perspective resonated with me, because it matches how I think about availability metrics. You can’t use availability metrics to inform you about whether your system is reliable enough, because they can only tell you if you have a problem. If your availability metrics look good, that doesn’t tell you anything about how to spend your engineering resources on reliability.

As Leveson remarked about safety, I think the best we can do in our non-safety-critical domains is study our systems to identify where the potential problems are, so that we can address them. Since we can’t actually quantify risk, the best we can do is to get better at identifying systemic issues. We need to always be looking for problems in the system, regardless of how many nines of availability we achieved last quarter. After all, that next major outage is always just around the corner.

The power of functionalism

Most software engineers are likely familiar with functional programming. The idea of functionalism, focusing on the “what” rather than the “how”, doesn’t just apply to programming. I was reminded of how powerful a functionalist approach is this week as while I’ve been attending the STAMP workshop. STAMP is an approach to systems safety developed by Nancy Leveson.

The primary metaphor in STAMP is the control system: STAMP employs a control system model to help reason about the safety of a system. This is very much a functionalist approach, as it models agents in the system based only on what control actions they can take and what feedback they can receive. You can use this same model to reason about a physical component, a software system, a human, a team, an organization, even a regulatory body. As long as you can identify the inputs your component receives, and the control actions that it can perform, you can model it as a control system.

Cognitive systems engineering (CSE) uses a different metaphor: that of a cognitive system. But CSE also takes a functional approach, observing how people actually work and trying to identify what functions their actions serve in the system. It’s a bottom-up functionalism where STAMP is top-down, so it yields different insights into the system.

What’s appealing to me about these functionalist approaches is that they change the way I look at a problem. They get me to think about the problem or system at hand in a different way than I would have if I didn’t take a deliberately take a functional approach. And “it helped me look at the world in a different way” is the highest compliment I can pay to a technology.

“How could they be so stupid?”

This is still the case. https://t.co/4tTv9kFizF
— John Allspaw (@allspaw) July 20, 2020

From the New York Times story on the recent Twitter hack:

Mr. O’Connor said other hackers had informed him that Kirk got access to the Twitter credentials when he found a way into Twitter’s internal Slack messaging channel and saw them posted there, along with a service that gave him access to the company’s servers.

It’s too soon after this incident to put too much faith in the reporting, but let’s assume it’s accurate. A collective cry of “Posting credentials to a Slack channel? How could engineers at Twitter be so stupid?” rose up from the internet. It’s a natural reaction, but it’s not a constructive one.

I don’t personally know any engineers at Twitter, but I have confidence that they have excellent engineers over there, including excellent security folks. So, how do we explain this seemingly obvious security lapse?

The problem is that we on the outside can’t, because we don’t have enough information. This type of lapse is a classic example of a workaround. People in a system use workarounds (they do things the “wrong” way) when there are obstacles to doing things the “right” way.

There are countless possibilities for why people employ workarounds. Maybe some system that’s required for doing it the “right” way is down for some reason, or maybe it simply takes too long or is too hard to do things the “right” way. Combine that with production pressures, and a workaround is born.

I’m willing to bet that there are people in your organization that use workarounds. You probably use some yourself. Identifying those workarounds teaches us something about how the system works, and how people have to do things the “wrong” way to actually get their work done.

Some workarounds, like the Twitter example, are dangerous. But simply observing “they shouldn’t have done that” does nothing to address the problems in the system that motivated the workaround in the first place.

When you see a workaround, don’t ask “how could they be so stupid to do things the obviously wrong way?” Instead, ask “what are the properties of our system that contributed to the development of this workaround?” Because, unless you gain a deeper understanding of your system, the problems that motivated the workaround aren’t going to go away.

A reasonable system

Reasonable is an adjective we typically apply to humans, or something we implore of them (“Be reasonable!”). And, while I do want reasonable colleagues, what I really want is a reasonable system.

By reasonable system, I mean a system whose behavior I can reason about, both backwards and forwards in time. Given my understanding of how the system works, and the signals that are emitted by the system, I want to be able to understand its past behavior, and predict what its behavior is going to be in the future.