Taming complexity: from contract to compact

The software contract

We software engineers love the metaphor of the contract when describing software behavior: If I give you X, you promise to give me Y in return. One example of a contract is the signature of a function in a statically typed language. Here’s a function signature in the Kotlin programming language:

fun exportArtifact(exportable: Exportable): DeliveryArtifact

This signature promises that if you call the exportArtifact function with an argument of type Exportable, the return value will be an object of type DeliveryArtifact.

Function signatures are a special case for software contracts, in that they can be enforced mechanically: the compiler guarantees that the contract will hold for any program that compiles successfully. In general, though, the software contracts that we care about can’t be mechanically checked. For example, we might talk about a contract that a particular service provides, but we don’t have tools that can guarantee that our service conforms to the contract. That’s why we have to test it.

Contracts are a type of specification: they tell us that if certain preconditions are met, the system described by the contract guarantees that certain postconditions will be met in return. The idea of reasoning about the behavior of a program using preconditions and postconditions was popularized by C.A.R. Hoare in his legendary paper An Axiomatic Basis for Computer Programming, and is known today as Hoare logic. The language of contract in the software engineering sense was popularized by Bertrand Meyer (specifically, design by contract) in his language Eiffel and his book Object-Oriented Software Construction.

We software engineers like contracts they they help us reason about the behavior of a system. Instead of requiring us to understand the complete details of a system that we interact with, all we need to do is understand the contract.

For a given system, it’s easier to reason about its behavior given a contract than from implementation details.

Contracts, therefore, are a form of abstraction. In addition, contracts are composable, we can feed the outputs of system X into system Y if the postconditions of Y are consistent with the preconditions of X. Because we can compose contracts, we can use them to help us build systems out of parts that are described by contracts. Contracts are a tool that enable us humans to work together to build software systems that are too complex for any individual human to understand.

When contracts aren’t useful

Alas, contracts aren’t much use for reasoning about system behavior when either of the following two conditions happen:

  1. A system’s implementation doesn’t fully conform to its contract.
  2. The precondition of a system’s contract is violated by a client.
An example of a contract where a precondition (number of allowed dependencies) was violated

Whether a problem falls into the first or second condition is a judgment call. Either way, your system is now in a bad state.

A contract is of no use for a system that has gotten into a bad state.

A system that has gotten into a bad state is violating its contract, pretty much by definition. This means we must now deal with the implementation details of the system in order to get it back into a good state. Since no one person understands the entire system, we often need the help of multiple people to get the system back into a good state.

Operational surprises often require that multiple engineers work together to get the system back into a good state

Since contracts can’t help us here, we deal with the complexity by leveraging the fact that different engineers have expertise in different parts of the system. By working together, we are pooling the expertise of the engineers. To pull this off, the engineers need to coordinate effectively. Enter the Basic Compact.

The Basic Compact and requirements of coordination

Gary Klein, Paul Feltovich and David Woods defined the Basic Compact in their paper Common Ground and Coordination in Joint Activity:

We propose that joint activity requires a “Basic Compact” that constitutes a level of commitment
for all parties to support the process of coordination. The Basic Compact is an agreement
(usually tacit) to participate in the joint activity and to carry out the required coordination
responsibilities.

One example of a joint activity is… when engineers assemble to resolve an incident! In doing so, they enter a Basic Compact: to work together to get the system back into a stable state. Working together on a task requires coordination, and the paper authors list three primary requirements to coordinate effectively on a joint activity: interpredictablity, common ground, and directability.

The Basic Compact is also a commitment to ensure a reasonable level of interpredictability. Moreover, the Basic Compact requires that if one party intends to drop out of the joint activity, he or she must inform the other parties.

Intepredictability is about being able to reason about the behavior of other people, and behaving in such a way that your behavior is reasonable to others. As with the world of software contracts, being able to reason about behavior is critical. Unlike software contracts, here we reasoning about agents rather than artifacts, and those agents are also reasoning about us.

The Basic Compact includes an expectation that the parties will repair faulty knowledge, beliefs and assumptions when these are detected.

Each engineer involved in resolving an incident beliefs about both the system state and the beliefs of other engineers involved. Keeping mutual beliefs up to date requires coordination work.

During an incident, the responders need to maintain a shared understanding about information such as the known state of the system and what mitigations people are about to attempt. The authors use the term common ground to describe this shared understanding. Anyone who has been in in an on call rotation will find the following description familiar:

All parties have to be reasonably confident that they and the others will carry out their
responsibilities in the Basic Compact. In addition to repairing common ground, these
responsibilities include such elements as acknowledging the receipt of signals, transmitting some
construal of the meaning of the signal back to the sender, and indicating preparation for
consequent acts.

Maintaining common ground during an incident takes active effort on behalf of the participants, especially when we’re physically distributed and the situation is dynamic: where the system is not only in a bad state, but it’s in a bad state that’s changing over time. Misunderstandings can creep in, which the authors describe as a common ground breakdown that requires repair to make progress.

A common ground breakdown can mean the difference between a resolution time of minutes and hours. I recall an incident I was involved with, where an engineer made a relevant comment in Slack early on during the incident, and I missed its significance in the moment. In retrospect, I don’t know if the engineer who sent the message realized that I hadn’t properly processed its implications at the time.

Directability refers to deliberate attempts to modify the actions of the other partners as conditions and priorities change.

Imagine a software system has gone unhealthy in one geographical region, and engineer X begins to execute a failover to remediate. Engineer Y notices customer impact in the new region, and types into Slack, “We’re now seeing a problem in the region we’re failing into! Abort the failover!” This is an example of directability, which describes the ability of one agent to affect the behavior of another agent through signaling.

Making contracts and compacts first class

Both contracts and compacts are tools to help deal with complexity. People use contracts to help reason about the behavior of software artifacts. People use the Basic Compact to help reason about each other’s behavior when working together to resolve an incident.

I’d like to see both contracts and compacts get better treatment as first-class concerns. For contracts, there still isn’t a mainstream language with first-class support for preconditions and postconditions, although some non-Eiffel languages do support them (Clojure and D, for example). There’s also Pact, which bills itself as a contract testing tool, that sounds interesting but I haven’t had a chance to play with.

For coordination (compacts), I’d like to see explicit recognition of the difficulty of coordination and the significant role it plays during incidents. One of the positive outcomes of the growing popularity of resilience engineering and the learning from incidents in Software movement is the recognition that coordination is a critical activity that we should spend more time learning about.

Further reading and watching

Common Ground and Coordination in Joint Activity is worth reading in its entirety. I only scratched the surface of the paper in this post. John Allspaw gave a great Papers We Love talk on this paper.

Laura Maguire has done some recent PhD work on managing the hidden costs of coordination. She also gave a talk at QCon on the subject.

Ten challenges for making automation a “team player” in joint human-agent activity is a paper that explores the implications of building software agents that are capable of coordinating effectively with humans.

An Axiomatic Basis for Computer Programming is worth reading to get a sense of the history of preconditions and postconditions. Check out Jean Yang’s Papers We Love talk on it.

.

Thoughts on STAMP

STAMP is an accident model developed by the safety researcher Nancy Leveson. MIT runs annual STAMP workshop, and this year’s workshop was online due to COVID-19. This was the first time I’d attended the workshop. I’d previously read a couple of papers on STAMP, as well as Leveson’s book Engineering A Safer World, but I’ve never used the two associated processes: STPA for identifying safety requirements, and CAST for accident analysis.

This post captures my impressions of STAMP after attending the workshop.

But before I can talk about STAMP, I have to explain the safety term hazards.

Hazards

Safety engineers work to prevent unacceptable losses. The word safety typically invokes losses such as human death or injury, but it can be anything (e.g., property damage, mission loss).

Engineers have control over the systems that are designing, but they don’t have control over the environment that the system is embedded within. As a consequence, safety engineers focus their work on states that the system could be in that could lead to a loss.

As an example, consider the scenario where an autonomous vehicle is is driving very close to another vehicle immediately in front of it. An accident may or may not happen depending on whether the vehicle in front slams on the breaks. An autonomous vehicle designer has no control over whether another driver slams on the brakes (that’s part the an environment), but they can design the system to control the separation distance from the vehicle ahead. Thus, we’d say that the hazard is that minimum separation between the two cars is violated. Safety engineers work to identify, prevent, and mitigate hazards.

As a fan of software formal methods, I couldn’t help thinking about what the formal methods folks call safety properties. A safety property in the formal methods sense is a system state that we want to guarantee is unreachable. (An example of a safety property would be: no two threads can be in the same critical section at the same time).

The difference between hazards in the safety sense and safety properties in the software formal methods sense is:

  • a software formal methods safety property is always a software state
  • a safety hazard is never a software state

Software can contribute to a hazard occurring, but a software state by itself is never a hazard. Leveson noted in the workshop that software is just a list of instructions: and a list of instructions can’t hurt a human being. (Of course, software can cause something else to hurt a human being! But that’s an explanation for how the system got into a hazardous state, it’s not the hazard).

The concept of hazards isn’t specific to STAMP. There are a number of other safety techniques for doing hazard analysis, such as HAZOP, FMEA, FMECA, and fault tree analysis. However, I’m not familiar with those techniques so I won’t be drawing any comparisons here.


If I had to sum up STAMP in two words, they would be: control and design. STAMP is about control, in two senses.

Control

In one sense, STAMP uses a control system model for doing hazard analysis. This is a functional model of controllers which interact with each other through control actions and feedback. Because it’s a functional model, a controller is just something that exercises control: a software-based subsystem, a human operator, a team of humans, even an entire organization, each would be modeled as a controller in a description of the system used for doing the hazard analysis.

In another sense, STAMP is about achieving safety through controlling hazards: once they’re identified, you use STAMP to generate safety requirements for controls for preventing or mitigating the hazards.

Design

STAMP is very focused on achieving safety through proper system design. The idea is that you identify safety requirements for controlling hazards, and then you make sure that your design satisfies those safety requirements. While STAMP influences the entire lifecycle, the sense I got from Nancy Leveon’s talks was that accidents can usually be attributed to hazards that were not controlled for by the system designer.


Here are some general, unstructured impressions I had of both STAMP and Nancy Leveson. I felt I got a much better sense of STPA than CAST during the workshop, so most of my impressions are about STPA.

Where I’m positive

I like STAMP’s approach of taking a systems view, and making a sharp distinction between reliability and safety. Leveson is clearly opposed to the ideas of root cause and human error as explanations for accidents.

I also like the control structure metaphor, because it encourages me to construct a different kind of functional model of the system to help reason about what can go wrong. I also liked how software and humans were modeled the same way, as controllers, although I also think this is problematic (see below). It was interesting to contrast the control system metaphor with the joint cognitive system metaphor employed by cognitive systems engineering.

Although I’m not in a safety-critical industry, I think I could pick up and apply STPA in my domain. Akamai is an existence proof that you can use this approach in a non-safety-critical software context. Akamai has even made their STAMP materials available online.

I was surprised how influenced Leveson was by some of Sidney Dekker’s work, given that she has a very different perspective than him. She mentioned “just culture” several times, and even softened the language around “unsafe control actions” used in the CAST incident analysis process because of the blame connotations.

I enjoyed Leveson’s perspective on probability risk assessment: that it isn’t useful to try to compute risk probabilities. Rather, you build an exploratory model, identify the hazards, and control for them.Very much in tune with my own thoughts on using metrics to assess how you’re doing versus identifying problems. STPA feels like a souped-up version of Gary Klein’s premortem method of risk assessment.

I liked the structure of the STPA approach, because it provides a scaffolding for novices to start with. Like any approach, STPA requires expertise to use effectively, but my impression is that the learning curve isn’t that steep to get started with it.

Where I’m skeptical

STAMP models humans in the system the same way it models all other controllers: using a control algorithm and a process model. I feel like this is too simplistic. How will people use workarounds to get their work done in ways that vary from the explicitly documented procedures? Similarly, there was nothing at all about anomaly response, and it wasn’t clear to me how you would model something like production pressure.

I think Leveson’s response would be that if people employ workarounds that lead to hazards, that indicates that there was a hazard missed by the designer during the hazard analysis. But I worry that these workarounds will introduce new control loops that the designer wouldn’t have known about. Similarly, Leveson isn’t interested in issues around uncertainty and human judgment of operators, because those likewise indicates flaws in the design to control the hazards.

STAMP generates recommendations and requirements, and these will change the system, and these can reverberate through the system, introducing new hazards. While the conference organizers were aware of this, it wasn’t as much of a first class concept as I would have liked: the discussions in the workshop tended to end at the recommendation level. There was a little bit of discussion about how to deal with the fact that STAMP can introduce new hazards, but I would have liked to have seen more. It wasn’t clear to me how you could use a control system model to solve the envisioned world problem: how will the system change how people work?

Leveson is very down on agile methods for safety-critical systems. I don’t have a particular opinion there, but in my world, agile makes a lot of sense, and our system is in constant flux. I don’t think any subset of engineers has enough knowledge of the system to sketch out a control model that would capture all of the control loops that are present at any one point in time. We could do some stuff, and find some hazards, and that would be useful! But I believe that in my context, this limits our ability to fully explore the space of potential hazards.

Leveson is also very critical on Hollnagel’s idea of Safety-II, and she finds his portrayal of Safety-I as unrecognizable in her own experiences doing safety engineering. She believes that depending on the adaptability of the human operators is akin to blaming accidents on human error. That sounded very strange to my areas.

While there wasn’t as much discussion of CAST in the workshop itself, there was some, and I did skim the CAST handbook as well. CAST was too big on “why” rather than “how” for my taste, and many of the questions generated by CAST are counterfactuals, which I generally have an allergic reaction to (see page 40 of the handbook).

During the workshop, I asked if the workshop organizers (Nancy Leveson and John Thomas) if they could think of a case where a system designed using STPA had an accident, and they did not know of an example where that had happened, and that took me aback. They both insisted that STPA wasn’t perfect, but it worries me that they hadn’t yet encountered a situation that revealed any weaknesses in the approach.

Outro

I’m always interested in approaches that can help us identify problems, and STAMP looks promising on that front. I’m looking forward to opportunities to get some practice with STPA and CAST.

However, I don’t believe we can control all of the hazards in our designs, at least, not in the context that I work in. I think there are just too many interactions in the system, and the system is undergoing too much change every day. And that means that, while STAMP can help me find some of the hazards, I have to worry about phenomena like anomaly response, and how well the human operators can adapt when the anomalies inevitably happen.

Why you can’t just ask “why”

Today, most AI work is based on neural networks, but back in the 1980s, AI researchers were using a different approach: they built rule-based systems using mathematical logic. This was the heyday of Lisp and Prolog, which were well-suited towards implementing these systems.

One approach AI researchers used was to sit down with an expert and elicit the rules they used to perform a task. For example, an AI researcher might conduct a series of interviews with a doctor in order to determine how the doctor diagnosed illnesses based on symptoms. The researcher would then encode those rules to build an expert system: a software package that would, ideally, perform tasks as well as an expert.

Alas, the results were disappointing: these expert systems never measured up to the performance of those human experts. Two brothers: Stuart Dreyfus (a professor of industrial engineering and operations research) and Hubert Dreyfus (a professor of philosophy) published a book in 1998 titled Mind Over Machine that described why this approach to building expert systems by eliciting and encoding rules from experts could never really work. It turns out that experts don’t actually solve problems by following a set of rules. Instead, they rely more on intuition and pattern-matching based on a repertoire of cases they’ve built up from their experience1.

Yet, even though those experts didn’t solve problems by following rules, they were still able to articulate a set of rules that they claim to follow when asked. And they weren’t trying to deceive the AI researchers. Instead, something else was going on. The experts were inventing explanations without even being aware that they were doing so. Philosophers of mind use the term confabulation (technically broad confabulation) to refer to this phenomenon: how people will unknowingly fabricate explanations for their actions.

And therein lies the problem of asking “why”.

In the wake of an incident, we often want to understand why it is people did certain things: both for the people whose actions contributed to the incident (why did they make a global configuration change?) and for people whose actions mitigated the incident (why did they suspect a retry storm rather than a DDOS attack?)

The problem is, you can’t just ask people why, because people confabulate. You can, of course, simply ask people why they took the actions they did. Heck, you might even get a confident, articulated explanation. But you shouldn’t believe that the explanation they give corresponds to reality.

Yet, getting at the why is important. This is not a case of “‘Why?’ is the wrong question the way that Five Whys style questions are. There is real value in understanding how people came to the decisions they did, by learning about the signals they received at the time, and how their previous experiences shaped their perspectives. That’s where having a skilled interviewer comes in.

A skilled interviewer will increase the chances of getting an accurate response by asking questions to bring the interviewee back into the frame of mind that they were in during the incident. Instead of asking for an engineer to explain their actions (Why did you do X?), they’ll ask questions to try to jog their memory of what they were experiencing during the incident: What were you doing when the page went off? Where did you look first? What did you see? And then what did you do? Because we know that experts do pattern-matching, they’ll also ask questions like, have you ever seen this symptom before? These questions can elicit responses about previous experiences they’ve had in similar situations, which can provide context on how they made their decisions in this case.

Eliciting this sort of information from an interview is hard, and it takes real skill. We should take this sort of work seriously.

1The field of research known as naturalistic decision making studies how experts make decisions.

Making peace with “root cause” during anomaly response

We haven’t figured out the root cause yet.

Uttered by many an engineer while responding to an anomaly

One of the contributions of cognitive systems engineering is treating anomaly response as something worthy of study. Here’s how Woods and Hollnagel describe it:

In anomaly response, there is some underlying process, an engineered or physiological process which will be referred to as the monitored process, whose state changes over time. Faults disturb the functions that go on in the monitored process and generate the demand for practitioners to act to compensate for these disturbances in order to maintain process integrity―what is sometimes referred to as “safing” activities. In parallel, practitioners carry out diagnostic activities to determine the source of the disturbances in order to correct the underlying problem..

David D. Woods, Erik Hollnagel, Joint Cognitive Systems: Patterns in Cognitive Systems Engineering, Chapter 8, p71

This type of work will be instantly recognizable to anyone who has been involved in software operations work, even though the domains that cognitive systems engineering researchers initially focused on are completely different (e.g., nuclear power plants, anesthesiology, commercial aviation, space flight).

Anomaly response involves multiple people working together, coordinating on resolving a common problem. Here’s Woods and Hollnagel again, discussing an exchange between two anesthesiologists during a neurosurgery case:

The situation calls for an update to the shared model of the case and its likely trajectory in the future … The exchange is very compact and highly coded, yet it serves to update the common ground previously established at the start of the case… Interestingly, the resident and attending after the update appear to be without candidate explanations as several possibilities have been dismissed given other findings (the resident is quite explicit in this case. After describing the unexpected event, he also adds―”but no explanation”.

p93, ibid

Just like those anesthesiologists, practitioners in all domains often communicate using “compact and highly coded” jargon. I’ve seen claims that jargon is intended to obfuscate, but it’s just the opposite during anomaly response: a team that shares jargon can communicate more efficiently, because of the pre-existing shared context about the precise meaning of those terms (assuming, of course, the team members understand those terms the same way).

That brings us to the “root cause”. Let’s start with a few words about root cause analysis.

Root cause analysis (RCA) is an approach for identifying why an incident happened. It’s often associated with the Five Whys approach, associated with Toyota. Members of the resilience engineering community have been very critical of RCA. For one critical take, check out John Allspaw’s piece The Infinite Hows (or, the Dangers of The Five Whys). Allspaw makes a compelling case, and I agree with him.

What I’m arguing in this post is that the term “root cause” has a completely different connotation when used during anomaly response than when used during post-incident analysis. When, during anomaly response, an engineer says “I haven’t found the root cause”, they do not mean, “I have not yet performed a Five-Whys root cause analysis”. Instead, they mean “the signals I am observing are inconsistent with my mental model of how the system behaves”. Or, less formally, “I know something’s wrong here, but I don’t know what it is!”

When an engineer says “we don’t know the root cause yet” during anomaly response, everybody involved understands what they mean. If you were to reply “actually, there is no such thing as root cause”, the best response you could hope for is a blank stare. The engineers aren’t talking about Five-Whys in this context. Instead, they’re doing dynamic fault management. They’re trying to make sense of what’s currently happening.

Because I’m one of those folks who is critical of RCA, I used to try to encourage people to say, “we don’t understand the failure mode yet” instead of “we don’t know the root cause yet”. But I’m going to stop encouraging them, because I have come around to believing that “we don’t know the root cause” is effective jargon when coordinating during anomaly response.

The post-incident investigation context is another story entirely, and I’m still going to fight the battles against root cause during the post-incident work. But, just as I wouldn’t try to do a one-on-one interview with an engineer while they were engaged in an incident, I’m no longer going to try do get engineers to stop saying root cause while they are engaged in an incident. If the experts at anomaly response find it a useful phrase while they are doing their work, we should recognize this as a part of their expertise.

Yes, it will probably make it a little harder to get an organization to shake off the notion of “root cause” if people still freely use the term during anomaly response. And I won’t use the term myself. But it’s a battle I no longer think is worth fighting.

The hard parts about making it look easy

Bill Clinton was known for projecting warmth in a way that Hillary Clinton didn’t. Yet, when journalists would speak to people who knew them both personally, the story they’d get back was the opposite: one-on-one, it was Hillary Clinton who was the warm one.

We use terms like warmth and authenticity as if they were character attributes of people. But imagine if you were asked to give a speech in front of a large audience. Do you think that if you came off as wooden or stilted, that would be an indicator of how authentic you are as a person?

The ability to project authenticity or warmth is a skill. Experts exhibiting skilled behavior often appear to do it effortlessly. When we watch a virtuoso perform in a domain we know something about, we exclaim “they make it look so easy!“, because we know how much harder it is than it looks.

The resilience engineering researcher David Woods calls this phenomenon the law of fluency, which he define as:

“Well”-adapted work occurs with a facility that belies the difficulty of the demands resolved and the dilemmas balanced.

Joint Cognitive Systems: Patterns in Cognitive Systems Engineering (p20)

This law is the source of two problems.

First of all, novices tend to mistake skilled performance that seems effortless as innate, rather than a skill that was developed with practice. They don’t see the work, so they don’t know how to get there.

Second of all, skilled practitioners are at increased risk of undetected burnout because they make it look easy even when they are working too hard. This is something that’s easy to miss unless we actively probe for it.

On the writing styles of some resilience engineering researchers

This post is a brief meditation in the writing styles of four luminaries in the field of resilience engineering: Drs. Erik Hollnagel, David Woods, Sidney Dekker, and Richard Cook.

This post was inspired by a conversation I had with my colleague J. Paul Reed. You can find links to papers by these authors at resiliencepapers.club.

Erik Hollnagel – the framework builder

Hollnagel often writes about frameworks or models. A framework is the sort of thing that you would illustrate with a box and arrow diagram, or a table with two or three columns. Here are some examples of Hollnagellian frameworks:

  • Safety-I vs. Safety-II
  • Functional Resonance Analysis Method (FRAM)
  • Resilience Analysis Grid (RAG)
  • Contextual Control Model (COCOM)
  • Cognitive Reliability and Error Analysis Method (CREAM)

Of the four researchers, Hollnagel’s writings read the most like traditional academic writing. Even his book Joint Cognitive Systems: Foundations of Cognitive Systems Engineering feels like something out of an academic journal. Of the four authors, he is the one I struggle the most with to gain insight from. Ironically, one of my favorite concepts I learned from him, the ETTO principle, is presented more as a pattern in the style of Woods, as described below.

David Woods – the pattern oracle

I believe that a primary goal of academic research is to identify patterns in the world that had not been recognized before. By this measure, David Woods is the most productive researcher I have encountered, in any field! Again and again, Woods identifies patterns inherent in the nature of how humans work and interact with technology, by looking across an extremely broad range of human activity, from power plant controllers to astronauts to medical doctors. Gaining insight from his work is like discovering there’s a white arrow in the FedEx logo: you never imagined it was before it was pointed out, and now that you know it’s impossible not to see.

These patterns are necessarily high-level, and Woods invents new vocabulary out of whole cloth to capture these new concepts. His writing contains terms like anomaly response, joint cognitive systems, graceful extensibility, units of adaptive behavior, net adaptive value, crunches, competence envelopes, dynamic fault management, adaptive stalls, and veils of fluency.

In Woods’s writing, he often introduces or references many new concepts, and writes about how they interact with each other. This style of writing tends to be very abstract. I’ve found that if I can map the concepts back into my own experiences in the software world, then I’m able to internalize them and they become powerful tools in my conceptual toolbox. But if I can’t make the connection, then I find it hard to find a handhold in scaling his writing. It wasn’t until I watched his video lectures, where he discussed many concrete examples, that I was able to really to understand many of his concepts.

Sidney Dekker – the public intellectual

Of the four researchers, Dekker produces the most amount of work written for a lay audience. My entrance into the world of resilience engineering was through his book Drift into Failure. Dekker’s writings in these books tend toward the philosophical, but they don’t read like academic philosophy papers. Rather, it’s more of the “big idea” kind of writing, similar in spirit (although not in tone) to the kinds of books that Nassim Taleb writes. In that sense, Dekker’s writing can go even broader than Woods’s, as Dekker muses on the perception of reality. He is the only one I can imagine writing books with titles such as Just Culture, The Safety Anarchist, or The End of Heaven.

Dekker often writes about how different worldviews shape our understanding of safety. For example, one of his more well-known papers contrasts “new” and “old” views on the nature of human error. In Drift Into Failure, he write about the Newtonian-Cartesian worldview and contrast it with a systems perspective. But he doesn’t present these worldviews as frameworks in the way that Hollnagel would. They are less structured, more qualitatively elaborated.

I’m a fan of the “big idea” style of non-fiction writing, and I was enormously influenced by Drift into Failure, which I found extremely readable. However, I’m particularly receptive to this style of writing, and most of my colleagues tend to prefer his Field Guide to Understanding ‘Human Error’, which is more practical.

Richard Cook – the raconteur

Cook’s most famous paper is likely How Complex Systems Fail, but that style of writing isn’t what comes to mind when I think of Cook (that paper is more of a Woods-ian identification of patterns).

Cook is the anti-Hollnagel: where Hollnagel constructs general frameworks, Cook elaborates the details of specific cases. He’s a storyteller, who is able to use stories to teach the reader about larger truths.

Many of Cook’s papers examine work in the domain of medicine. Because Cook has a medical background (he was a practicing anesthesiologist before he was a researcher), he has deep knowledge of that domain and is able to use it to great effect in his analysis on the interactions between humans, technology, and work. A great example of this is how his paper on the allocation of ICU beds in Being Bumpable. His Re-Deploy talk entitled The Resilience of Bone and Resilience Engineering is another example of leveraging the details of a specific case to illustrate broader concepts.

Of the four authors, I think that Cook is the one who is most effective at using specific cases to explain complex concepts. He functions almost as interpreter for grounding Woods-ian concepts in concrete practice. It’s a style of writing that I aspire to. After all, there’s no more effective way to communicate than to tell a good story.

Chasing down the blipperdoodles

To a first approximation, there are two classes of automated alerts:

  1. A human needs to look at this as soon as possible (page the on-call!)
  2. A human should eventually investigate, but it isn’t urgent (email-only alert)

This post is about the second category. These are events like the error spikes that happened at 2am that can wait until business hours to look into.

When I was on the CORE1 team, one of the responsibilities of team members was to investigate these non-urgent alert emails. The team colorfully referred to them as blipperdoodles2, presumably because they look like blips on the dashboard.

I didn’t enjoy this part of the work. Blipperdoodles can be a pain to track down, are often not actionable (e.g., networking transient), and, in the tougher cases, are downright impossible to make sense of. This means that the work feels largely unsatisfying. As a software engineer, I’ve felt a powerful instinct to dismiss transient errors, often with a joke about cosmic rays.

But I’ve really come around on the value of chasing down blipperdoodles. Looking back, they gave me an opportunity to practice doing diagnostic work, in a low-stakes environment. There’s little pressure on you when you’re doing this work, and if something more urgent comes up, the norms of the team allow you to abandon your investigation. After all, it’s just a blipperdoodle.

Blipperdoodles also tend to be a good mix of simple and difficult. Some of them are common enough that experienced engineers can diagnose them by the shape of the graphs. Others are so hard that an engineer has to admit defeat once they reach their self-imposed timebox for the investigation. Most are in between.

Chasing blipperdoodles is a form of operational training. And while it may be frustrating to spend your time tracking down anomalies, you’ll appreciate the skills you’ve developed when the heat is on, which is what happens when everything is on fire.

1 CORE stands for Critical Operations & Reliability Engineering. They’re the centralized incident management team at Netflix.

2I believe Brian Trump coined the term.

The inevitable double bind

Here are three recent COVID-19 news stories:

The first two stories are about large organizations (the FDA, large banks) moving too slowly in order to comply with regulations. The third story is about the risks of the FDA moving too quickly.

Whenever an agent is under pressure to simultaneously act quickly and carefully, they are faced with a double-bind. If they proceed quickly and something goes wrong, they will be faulted for not being careful enough. If they proceed carefully and something goes wrong, they will be faulted for not moving quickly enough.

In hindsight, it’s easy to identify who wasn’t quick enough and who wasn’t careful enough. But if you want to understand how agents make these decisions, you need to understand the multiple pressures that agents experience, because they are trading these off. You also need to understand what information they had available at the time, as well as their previous experiences. I thought this observation of the behavior of the banks was particularly insightful.

But it does tell a more general story about the big banks, that they have invested so much in at least the formalities of compliance that they have become worse than small banks at making loans to new customers.

Matt Levine

Reactions to previous incidents have unintended consequences to the future. The conclusion to draw here isn’t that “the banks are now overregulated”. Rather, it’s that double binds are unavoidable: we can’t eliminate them by adding or removing regulations. There’s no perfect knob setting where they don’t happen anymore.

Once we accept that double binds are inevitable, we can shift of our focus away from just adjusting the knob and towards work that will prepare agents to make more effective decisions when they inevitably encounter the next double bind.

Embracing the beautiful mess

Nobody likes a mess. Especially in the world of software engineering, we always strive to build well-structured systems. No one ever sets out to build a big ball of mud.

Alas, we are constantly reminded that the systems we work in are messier than we’d like. This messiness often comes to light in the wake of an incident, when we dig in to understand what happened. Invariably, we find that the people are a particularly messy part of the overall system, that the actions that they take contribute to incidents. In the wake of the incident, we identify follow-up work that we hope will bring more order, less mess, into our world. What we miss, though, is the role that the messy nature of our systems play in keeping things working.

When I use the term system here, I mean it in the broader sense of a socio-technical system that includes both the technological elements (software, hardware) and the humans involved, the operators in particular.

Yes, there are neat, well-designed structures in place that help keep our system healthy: elements that include automated integration tests, canary deployments, and staffed on-call rotations. But complementing those structures are informal layers of defense provided by the people in our system. These are the teammates who are not on-call but jump in to help, or folks who just happen to lurk in Slack channels and provide key context at the right moment, to either help diagnose an incident or prevent one from happening in the first place.

This informal, messy system of defense is like a dynamic, overlapping patchwork. And sometimes this system fails: for example, a person who would normally chime in with relevant information happens to be out of the office that day. Or, someone takes an action which, under typical circumstances, would be beneficial, but under the specific circumstances of the incident, actually made things worse.

We would never set out to design a socio-technical system the way our systems actually are. Yet, these organic, messy systems actually work better than the neat, orderly systems that engineers dream of, because of how the messy system leverages human expertise.

It’s tempting to bemoan messiness, and to always try to reduce it. And, yes, messiness can be an indicator of problems in the system: for example, people using workarounds instead of how the system was intended to be used are an example of a kind of messiness that points to a shortcoming in our system.

But the human messiness we see under the ugly light of failure is the messiness that actually helps keep the system up and running when that light isn’t shining. If we want to get better at keeping our systems up and running, we need to understand what the mess looks like when things are actually working. We need to learn to embrace the mess. Because there’s beauty in that mess, the beauty of a system that keeps on running day after day.

How did software get so reliable without proof?

In 1996, the Turing-award-winning computer scientist C.A.R. Hoare wrote a paper with the title How Did Software Get So Reliable Without Proof? In this paper, Hoare grapples with the observation that software seems to be more reliable than computer science researchers expected was possible without the use of mathematical proofs for verification (emphasis added):

Twenty years ago it was reasonable to predict that the size and ambition of software products would be severely limited by the unreliability of their component programs … Dire warnings have been issued of the dangers of safety-critical software controlling health equipment, aircraft, weapons systems and industrial processes, including nuclear power stations … Fortunately, the problem of program correctness has turned out to be far less serious than predicted …

So the questions arise: why have twenty years of pessimistic predictions been falsified? Was it due to successful application of the results of the research which was motivated by the predictions? How could that be, when clearly little software has ever has been subjected to the rigours of formal proof?

Hoare offers five explanations for how software became more reliable: management, testing, debugging, programming methodology, and (my personal favorite) over-engineering.

Looking back on this paper, what strikes me is the absence of acknowledgment of the role that human operators play in the types of systems that Hoare writes about (health equipment, aircraft, weapons systems, industrial processes, nuclear power). In fact, the only time the word “operator” appears in the text is when it precedes the word “error” (emphasis added)

The ultimate and very necessary defence of a real time system against arbitrary hardware error or operator error is the organisation of a rapid procedure for restarting the entire system.

Ironically, the above line is the closest Hoare gets to recognizing the role that humans can play in keeping the system running.

The problem with the question “How did software get so reliable without proof?” is that it’s asking the wrong question. It’s not that software got so reliable without proof: it’s that systems that include software got so reliable without proof.

By focusing only on the software, Hoare missed the overall system. And whether you call them socio-technical systems, software-intensive systems, or joint cognitive systems, if you can’t see the larger system, you are doomed to not even be able to ask the right questions.