Abstractions and implicit preconditions

One of my favorite essays by Joel Spolsky is The Law of Leaky Abstractions. He phrases the law as:

All non-trivial abstractions, to some degree, are leaky.

One of the challenges with abstractions is that they depend upon preconditions: the world has to be in a certain state for the abstraction to hold. Sometimes the consumer of the abstraction is explicitly aware of the precondition, but sometimes they aren’t. After all, the advantage of an abstraction is that it hides information. NFS allows a user to access files stored remotely without having to know networking details. Except when there’s some sort of networking problem, and the user is completely flummoxed. The advantage of not having to know how NFS works has become a liability.

The problem of implicit preconditions is everywhere in complex systems. We are forever consuming abstractions that have a set of preconditions that must be true for the abstraction to work correctly. Poke at an incident, and you’ll almost always find an implicit precondition. Something we didn’t even know about, that always has to be true, that was always true, until now.

Abstractions make us more productive, and, indeed, we humans can’t build complex systems without them. But we need to be able to peel away the abstraction layers when things go wrong, so we can discover the implicit precondition that’s been violated.

A nudge in the right direction

I was involved in an operational surprise a few weeks ago where some of my colleagues, while not directly involved in handling the incident, nudged us in directions that helped with quick remediation.

In one case, a colleague suggested moving the discussion into a different Slack channel, and in another case, a colleague focused the attention on a potential trigger: some newly inserted database records.

I also remember another operational surprise where an experienced engineer asked someone in Slack, “Hey, there’s a new person on our team, can you explain what X means”, and the response kicked off a series of events which brought someone else in that had more context, which led to the surprise being remediated much more quickly.

These sorts of nudges fly under our radar, and so they’re easy to miss. But they can make the difference between an operational surprise with no customer a multi-hour outage, and they can be contingent on the right person who happens to be in the right Slack channel at the right time, seeing the right message.

Unless we treat this sort of activity as first class when looking at incidents, we won’t really understand how it can be that some incidents get resolved so quickly and some take much longer.

Thoughts on STAMP

STAMP is an accident model developed by the safety researcher Nancy Leveson. MIT runs annual STAMP workshop, and this year’s workshop was online due to COVID-19. This was the first time I’d attended the workshop. I’d previously read a couple of papers on STAMP, as well as Leveson’s book Engineering A Safer World, but I’ve never used the two associated processes: STPA for identifying safety requirements, and CAST for accident analysis.

This post captures my impressions of STAMP after attending the workshop.

But before I can talk about STAMP, I have to explain the safety term hazards.

Hazards

Safety engineers work to prevent unacceptable losses. The word safety typically invokes losses such as human death or injury, but it can be anything (e.g., property damage, mission loss).

Engineers have control over the systems that are designing, but they don’t have control over the environment that the system is embedded within. As a consequence, safety engineers focus their work on states that the system could be in that could lead to a loss.

As an example, consider the scenario where an autonomous vehicle is is driving very close to another vehicle immediately in front of it. An accident may or may not happen depending on whether the vehicle in front slams on the breaks. An autonomous vehicle designer has no control over whether another driver slams on the brakes (that’s part the an environment), but they can design the system to control the separation distance from the vehicle ahead. Thus, we’d say that the hazard is that minimum separation between the two cars is violated. Safety engineers work to identify, prevent, and mitigate hazards.

As a fan of software formal methods, I couldn’t help thinking about what the formal methods folks call safety properties. A safety property in the formal methods sense is a system state that we want to guarantee is unreachable. (An example of a safety property would be: no two threads can be in the same critical section at the same time).

The difference between hazards in the safety sense and safety properties in the software formal methods sense is:

a software formal methods safety property is always a software state
a safety hazard is never a software state

Software can contribute to a hazard occurring, but a software state by itself is never a hazard. Leveson noted in the workshop that software is just a list of instructions: and a list of instructions can’t hurt a human being. (Of course, software can cause something else to hurt a human being! But that’s an explanation for how the system got into a hazardous state, it’s not the hazard).

The concept of hazards isn’t specific to STAMP. There are a number of other safety techniques for doing hazard analysis, such as HAZOP, FMEA, FMECA, and fault tree analysis. However, I’m not familiar with those techniques so I won’t be drawing any comparisons here.

If I had to sum up STAMP in two words, they would be: control and design. STAMP is about control, in two senses.

Control

In one sense, STAMP uses a control system model for doing hazard analysis. This is a functional model of controllers which interact with each other through control actions and feedback. Because it’s a functional model, a controller is just something that exercises control: a software-based subsystem, a human operator, a team of humans, even an entire organization, each would be modeled as a controller in a description of the system used for doing the hazard analysis.

In another sense, STAMP is about achieving safety through controlling hazards: once they’re identified, you use STAMP to generate safety requirements for controls for preventing or mitigating the hazards.

Design

STAMP is very focused on achieving safety through proper system design. The idea is that you identify safety requirements for controlling hazards, and then you make sure that your design satisfies those safety requirements. While STAMP influences the entire lifecycle, the sense I got from Nancy Leveon’s talks was that accidents can usually be attributed to hazards that were not controlled for by the system designer.

Here are some general, unstructured impressions I had of both STAMP and Nancy Leveson. I felt I got a much better sense of STPA than CAST during the workshop, so most of my impressions are about STPA.

Where I’m positive

I like STAMP’s approach of taking a systems view, and making a sharp distinction between reliability and safety. Leveson is clearly opposed to the ideas of root cause and human error as explanations for accidents.

I also like the control structure metaphor, because it encourages me to construct a different kind of functional model of the system to help reason about what can go wrong. I also liked how software and humans were modeled the same way, as controllers, although I also think this is problematic (see below). It was interesting to contrast the control system metaphor with the joint cognitive system metaphor employed by cognitive systems engineering.

Although I’m not in a safety-critical industry, I think I could pick up and apply STPA in my domain. Akamai is an existence proof that you can use this approach in a non-safety-critical software context. Akamai has even made their STAMP materials available online.

I was surprised how influenced Leveson was by some of Sidney Dekker’s work, given that she has a very different perspective than him. She mentioned “just culture” several times, and even softened the language around “unsafe control actions” used in the CAST incident analysis process because of the blame connotations.

I enjoyed Leveson’s perspective on probability risk assessment: that it isn’t useful to try to compute risk probabilities. Rather, you build an exploratory model, identify the hazards, and control for them.Very much in tune with my own thoughts on using metrics to assess how you’re doing versus identifying problems. STPA feels like a souped-up version of Gary Klein’s premortem method of risk assessment.

I liked the structure of the STPA approach, because it provides a scaffolding for novices to start with. Like any approach, STPA requires expertise to use effectively, but my impression is that the learning curve isn’t that steep to get started with it.

Where I’m skeptical

STAMP models humans in the system the same way it models all other controllers: using a control algorithm and a process model. I feel like this is too simplistic. How will people use workarounds to get their work done in ways that vary from the explicitly documented procedures? Similarly, there was nothing at all about anomaly response, and it wasn’t clear to me how you would model something like production pressure.

I think Leveson’s response would be that if people employ workarounds that lead to hazards, that indicates that there was a hazard missed by the designer during the hazard analysis. But I worry that these workarounds will introduce new control loops that the designer wouldn’t have known about. Similarly, Leveson isn’t interested in issues around uncertainty and human judgment of operators, because those likewise indicates flaws in the design to control the hazards.

STAMP generates recommendations and requirements, and these will change the system, and these can reverberate through the system, introducing new hazards. While the conference organizers were aware of this, it wasn’t as much of a first class concept as I would have liked: the discussions in the workshop tended to end at the recommendation level. There was a little bit of discussion about how to deal with the fact that STAMP can introduce new hazards, but I would have liked to have seen more. It wasn’t clear to me how you could use a control system model to solve the envisioned world problem: how will the system change how people work?

Leveson is very down on agile methods for safety-critical systems. I don’t have a particular opinion there, but in my world, agile makes a lot of sense, and our system is in constant flux. I don’t think any subset of engineers has enough knowledge of the system to sketch out a control model that would capture all of the control loops that are present at any one point in time. We could do some stuff, and find some hazards, and that would be useful! But I believe that in my context, this limits our ability to fully explore the space of potential hazards.

Leveson is also very critical on Hollnagel’s idea of Safety-II, and she finds his portrayal of Safety-I as unrecognizable in her own experiences doing safety engineering. She believes that depending on the adaptability of the human operators is akin to blaming accidents on human error. That sounded very strange to my areas.

While there wasn’t as much discussion of CAST in the workshop itself, there was some, and I did skim the CAST handbook as well. CAST was too big on “why” rather than “how” for my taste, and many of the questions generated by CAST are counterfactuals, which I generally have an allergic reaction to (see page 40 of the handbook).

During the workshop, I asked if the workshop organizers (Nancy Leveson and John Thomas) if they could think of a case where a system designed using STPA had an accident, and they did not know of an example where that had happened, and that took me aback. They both insisted that STPA wasn’t perfect, but it worries me that they hadn’t yet encountered a situation that revealed any weaknesses in the approach.

Outro

I’m always interested in approaches that can help us identify problems, and STAMP looks promising on that front. I’m looking forward to opportunities to get some practice with STPA and CAST.

However, I don’t believe we can control all of the hazards in our designs, at least, not in the context that I work in. I think there are just too many interactions in the system, and the system is undergoing too much change every day. And that means that, while STAMP can help me find some of the hazards, I have to worry about phenomena like anomaly response, and how well the human operators can adapt when the anomalies inevitably happen.

Soak time

This is what makes separating “action items generation” from post-incident review meetings so valuable. Soak time works.

Your kid gets it. 🙂
— John Allspaw (@allspaw) April 28, 2020

Over the past few weeks, I’ve had the experience multiple times where I’m communicating with someone in a semi-synchronous way (e.g., Slack, Twitter), and I respond to them without having properly understood what they were trying to communicate.

In one instance, I figured out my mistake during the conversation, and in another instance, I didn’t fully get it until after the conversation had completed, and I was out for a walk.

In these circumstances, I find that I’m primed to respond based on my expectations, which makes me likely to misinterpret. The other person is rarely trying to communicate what I’m expecting them to. Too often, I’m simply waiting for my turn to talk instead of really listening to what they are trying to say.

It’s tempting to blame this on Slack or Twitter, but I think this principle applies in all synchronous or semi-synchronous communications: including face-to-face conversations. I’ve certainly experienced this when I’ve been in technical interviews, where my brain is always primed to think, “What answer is the interviewer looking for to that question?”

John Allspaw uses the term soak time to refer to the additional time it takes us to process the information we’ve received in a post-incident review meeting, so we can make better decisions about what the next steps are. I think it describes this phenomenon well.

Whether you call it soak time, l’esprit de l’escalier, or hammock-driven development, keep in mind that it takes time for your brain to process information. Give yourself permission to take that time. Insist on it.

In praise of the Wild West engineer

If you put software engineers on a continuum with slow and cautious at one end and Wild West at the other end, I’ve always been on the slow and cautious side. But I’ve come to appreciate the value of the Wild West engineer.

Here’s an example of Wild West behavior: during some sort of important operational task (let’s say a failover), the engineer carrying it out sees a Python stack trace appear in the UI. Whoops, there’s a bug in the operational code! They ssh to the box, fix the code, and then run it again. It works!

This sort of behavior used to horrify me. How could you just hotfix the code by ssh’ing to the box and updating the code by hand like that? Do you know how dangerous that is? You’re supposed to PR, run tests, and deploy a new image!

But here’s the thing. During an actual incident, the engineers involved will have to take risky actions in order to remediate. You’ve got to poke and prod at the system in a way that’s potentially unsafe in order to (hopefully!) make things better. As Richard Cook wrote, all practitioner actions are gambles. The Wild West engineers are the ones with the most experience making these sorts of production changes, so they’re more likely to be successful at them in these circumstances.

I also don’t think that a Wild West engineer is someone who simply takes unnecessary risks. Rather, they have a well-calibrated sense of what’s risky and what isn’t. In particular, if something breaks, they know how to fix it. Once, years ago, during an incident, I made a global change to a dynamic property in order to speed up a remediation, and a Wild West engineer I knew clucked with disapproval. That was a stupid risk I took. You always stage your changes across regions!

Now, simply because the Wild West engineers have a well-calibrated sense of risk, doesn’t mean their sense is always correct! David Woods notes that all systems are simultaneously well-adapted, under-adapted, and over-adapted. The Wild West engineer might miscalculate a risk But I think it’s a mistake to dismiss Wild West engineers as simply irresponsible. While I’m still firmly the slow-and-cautious type, when everything is on fire, I’m happy to have the Wild West engineers around to take those dangerous remediation actions. Because if it ends up making things worse, they’ll know how to handle it. They’ve been there before.

Safe by design?

I’ve been enjoying the ongoing MIT STAMP workshop. In particular, I’ve been enjoying listening to Nancy Leveson talk about system safety. Leveson is a giant in the safety research community (and, incidentally, an author of my favorite software engineering study). She’s also a critic of root cause and human error as explanations for accidents. Despite this, she has a different perspective on safety than many in the resilience engineering community. To sharpen my thinking, I’m going to capture my understanding of this difference in this post below.

From Leveson’s perspective, the engineering design should ensure that the system is safe. More specifically, the design should contain controls that eliminate or mitigate hazards. In this view, accidents are invariably attributable to design errors: a hazard in the system was not effectively controlled in the design.

"The operator's job is to make up for the holes in the designer's work." – Jens Rasmussen (1981)
— John Allspaw (@allspaw) July 18, 2012

By contrast, many in the resilience engineering community claim that design alone cannot ensure that the system is safe. The idea here is that the system design will always be incomplete, and the human operators must adapt their local work to make up for the gaps in the designed system. These adaptations usually contribute to safety, and sometimes contribute to incidents, and in post-incident investigations we often only notice the latter case.

These perspectives are quite different. Leveson believes that depending on human adaptation in the system is itself dangerous. If we’re depending on human adaptation to achieve system safety, then the design engineers have not done their jobs properly in controlling hazards. The resilience engineering folks believe that depending on human adaptation is inevitable, because of the messy nature of complex systems.

All we can do is find problems

I’m in the second week of the three week virtual MIT STAMP workshop. Today, Prof. Nancy Leveson gave a talk titled Safety Assurance (Safety Case): Is it Possible? Feasible? Safety assurance refers to the act of assuring that a system is safe, after the design has been completed.

Leveson is a skeptic of evaluating the safety of a system. Instead, she argues for focusing on generating safety requirements at the design stage so that safety can be designed in, rather than doing an evaluation post-design. (You can read her white paper for more details on her perspective). Here are the last three bullets from her final slide:

If you are using hazard analysis to prove your system is safe, then you are using it wrong and your goal is futile
Hazard analysis (using any method) can only help you find problems, it cannot prove that no problems exist
The general problem is in setting the right psychological goal. It should not be “confirmation,” but exploration

This perspective resonated with me, because it matches how I think about availability metrics. You can’t use availability metrics to inform you about whether your system is reliable enough, because they can only tell you if you have a problem. If your availability metrics look good, that doesn’t tell you anything about how to spend your engineering resources on reliability.

As Leveson remarked about safety, I think the best we can do in our non-safety-critical domains is study our systems to identify where the potential problems are, so that we can address them. Since we can’t actually quantify risk, the best we can do is to get better at identifying systemic issues. We need to always be looking for problems in the system, regardless of how many nines of availability we achieved last quarter. After all, that next major outage is always just around the corner.

The power of functionalism

Most software engineers are likely familiar with functional programming. The idea of functionalism, focusing on the “what” rather than the “how”, doesn’t just apply to programming. I was reminded of how powerful a functionalist approach is this week as while I’ve been attending the STAMP workshop. STAMP is an approach to systems safety developed by Nancy Leveson.

The primary metaphor in STAMP is the control system: STAMP employs a control system model to help reason about the safety of a system. This is very much a functionalist approach, as it models agents in the system based only on what control actions they can take and what feedback they can receive. You can use this same model to reason about a physical component, a software system, a human, a team, an organization, even a regulatory body. As long as you can identify the inputs your component receives, and the control actions that it can perform, you can model it as a control system.

Cognitive systems engineering (CSE) uses a different metaphor: that of a cognitive system. But CSE also takes a functional approach, observing how people actually work and trying to identify what functions their actions serve in the system. It’s a bottom-up functionalism where STAMP is top-down, so it yields different insights into the system.

What’s appealing to me about these functionalist approaches is that they change the way I look at a problem. They get me to think about the problem or system at hand in a different way than I would have if I didn’t take a deliberately take a functional approach. And “it helped me look at the world in a different way” is the highest compliment I can pay to a technology.

“How could they be so stupid?”

This is still the case. https://t.co/4tTv9kFizF
— John Allspaw (@allspaw) July 20, 2020

From the New York Times story on the recent Twitter hack:

Mr. O’Connor said other hackers had informed him that Kirk got access to the Twitter credentials when he found a way into Twitter’s internal Slack messaging channel and saw them posted there, along with a service that gave him access to the company’s servers.

It’s too soon after this incident to put too much faith in the reporting, but let’s assume it’s accurate. A collective cry of “Posting credentials to a Slack channel? How could engineers at Twitter be so stupid?” rose up from the internet. It’s a natural reaction, but it’s not a constructive one.

I don’t personally know any engineers at Twitter, but I have confidence that they have excellent engineers over there, including excellent security folks. So, how do we explain this seemingly obvious security lapse?

The problem is that we on the outside can’t, because we don’t have enough information. This type of lapse is a classic example of a workaround. People in a system use workarounds (they do things the “wrong” way) when there are obstacles to doing things the “right” way.

There are countless possibilities for why people employ workarounds. Maybe some system that’s required for doing it the “right” way is down for some reason, or maybe it simply takes too long or is too hard to do things the “right” way. Combine that with production pressures, and a workaround is born.

I’m willing to bet that there are people in your organization that use workarounds. You probably use some yourself. Identifying those workarounds teaches us something about how the system works, and how people have to do things the “wrong” way to actually get their work done.

Some workarounds, like the Twitter example, are dangerous. But simply observing “they shouldn’t have done that” does nothing to address the problems in the system that motivated the workaround in the first place.

When you see a workaround, don’t ask “how could they be so stupid to do things the obviously wrong way?” Instead, ask “what are the properties of our system that contributed to the development of this workaround?” Because, unless you gain a deeper understanding of your system, the problems that motivated the workaround aren’t going to go away.

A reasonable system

Reasonable is an adjective we typically apply to humans, or something we implore of them (“Be reasonable!”). And, while I do want reasonable colleagues, what I really want is a reasonable system.

By reasonable system, I mean a system whose behavior I can reason about, both backwards and forwards in time. Given my understanding of how the system works, and the signals that are emitted by the system, I want to be able to understand its past behavior, and predict what its behavior is going to be in the future.