Taming complexity: from contract to compact

The software contract

We software engineers love the metaphor of the contract when describing software behavior: If I give you X, you promise to give me Y in return. One example of a contract is the signature of a function in a statically typed language. Here’s a function signature in the Kotlin programming language:

fun exportArtifact(exportable: Exportable): DeliveryArtifact

This signature promises that if you call the exportArtifact function with an argument of type Exportable, the return value will be an object of type DeliveryArtifact.

Function signatures are a special case for software contracts, in that they can be enforced mechanically: the compiler guarantees that the contract will hold for any program that compiles successfully. In general, though, the software contracts that we care about can’t be mechanically checked. For example, we might talk about a contract that a particular service provides, but we don’t have tools that can guarantee that our service conforms to the contract. That’s why we have to test it.

Contracts are a type of specification: they tell us that if certain preconditions are met, the system described by the contract guarantees that certain postconditions will be met in return. The idea of reasoning about the behavior of a program using preconditions and postconditions was popularized by C.A.R. Hoare in his legendary paper An Axiomatic Basis for Computer Programming, and is known today as Hoare logic. The language of contract in the software engineering sense was popularized by Bertrand Meyer (specifically, design by contract) in his language Eiffel and his book Object-Oriented Software Construction.

We software engineers like contracts they they help us reason about the behavior of a system. Instead of requiring us to understand the complete details of a system that we interact with, all we need to do is understand the contract.

For a given system, it’s easier to reason about its behavior given a contract than from implementation details.

Contracts, therefore, are a form of abstraction. In addition, contracts are composable, we can feed the outputs of system X into system Y if the postconditions of Y are consistent with the preconditions of X. Because we can compose contracts, we can use them to help us build systems out of parts that are described by contracts. Contracts are a tool that enable us humans to work together to build software systems that are too complex for any individual human to understand.

When contracts aren’t useful

Alas, contracts aren’t much use for reasoning about system behavior when either of the following two conditions happen:

A system’s implementation doesn’t fully conform to its contract.
The precondition of a system’s contract is violated by a client.

One time the build of Google’s dashboarding system broke because someone imported a Python math library to perform some regression analysis inside of graphs and this imported all of FORTRAN which pushed the total number of individual dependencies above the 60K threshold.
— Alex Hidalgo (@ahidalgosre) August 28, 2020

An example of a contract where a precondition (number of allowed dependencies) was violated

Whether a problem falls into the first or second condition is a judgment call. Either way, your system is now in a bad state.

A contract is of no use for a system that has gotten into a bad state.

A system that has gotten into a bad state is violating its contract, pretty much by definition. This means we must now deal with the implementation details of the system in order to get it back into a good state. Since no one person understands the entire system, we often need the help of multiple people to get the system back into a good state.

Operational surprises often require that multiple engineers work together to get the system back into a good state

Since contracts can’t help us here, we deal with the complexity by leveraging the fact that different engineers have expertise in different parts of the system. By working together, we are pooling the expertise of the engineers. To pull this off, the engineers need to coordinate effectively. Enter the Basic Compact.

The Basic Compact and requirements of coordination

Gary Klein, Paul Feltovich and David Woods defined the Basic Compact in their paper Common Ground and Coordination in Joint Activity:

We propose that joint activity requires a “Basic Compact” that constitutes a level of commitment for all parties to support the process of coordination. The Basic Compact is an agreement (usually tacit) to participate in the joint activity and to carry out the required coordination responsibilities.

One example of a joint activity is… when engineers assemble to resolve an incident! In doing so, they enter a Basic Compact: to work together to get the system back into a stable state. Working together on a task requires coordination, and the paper authors list three primary requirements to coordinate effectively on a joint activity: interpredictablity, common ground, and directability.

The Basic Compact is also a commitment to ensure a reasonable level of interpredictability. Moreover, the Basic Compact requires that if one party intends to drop out of the joint activity, he or she must inform the other parties.

Intepredictability is about being able to reason about the behavior of other people, and behaving in such a way that your behavior is reasonable to others. As with the world of software contracts, being able to reason about behavior is critical. Unlike software contracts, here we reasoning about agents rather than artifacts, and those agents are also reasoning about us.

The Basic Compact includes an expectation that the parties will repair faulty knowledge, beliefs and assumptions when these are detected.

Each engineer involved in resolving an incident has beliefs about both the system state and the beliefs of other engineers involved. Keeping mutual beliefs up to date requires coordination work.

During an incident, the responders need to maintain a shared understanding about information such as the known state of the system and what mitigations people are about to attempt. The authors use the term common ground to describe this shared understanding. Anyone who has been in in an on call rotation will find the following description familiar:

All parties have to be reasonably confident that they and the others will carry out their responsibilities in the Basic Compact. In addition to repairing common ground, these responsibilities include such elements as acknowledging the receipt of signals, transmitting some construal of the meaning of the signal back to the sender, and indicating preparation for consequent acts.

Maintaining common ground during an incident takes active effort on behalf of the participants, especially when we’re physically distributed and the situation is dynamic: where the system is not only in a bad state, but it’s in a bad state that’s changing over time. Misunderstandings can creep in, which the authors describe as a common ground breakdown that requires repair to make progress.

A common ground breakdown can mean the difference between a resolution time of minutes and hours. I recall an incident I was involved with, where an engineer made a relevant comment in Slack early on during the incident, and I missed its significance in the moment. In retrospect, I don’t know if the engineer who sent the message realized that I hadn’t properly processed its implications at the time.

Directability refers to deliberate attempts to modify the actions of the other partners as conditions and priorities change.

Imagine a software system has gone unhealthy in one geographical region, and engineer X begins to execute a failover to remediate. Engineer Y notices customer impact in the new region, and types into Slack, “We’re now seeing a problem in the region we’re failing into! Abort the failover!” This is an example of directability, which describes the ability of one agent to affect the behavior of another agent through signaling.

Making contracts and compacts first class

Both contracts and compacts are tools to help deal with complexity. People use contracts to help reason about the behavior of software artifacts. People use the Basic Compact to help reason about each other’s behavior when working together to resolve an incident.

I’d like to see both contracts and compacts get better treatment as first-class concerns. For contracts, there still isn’t a mainstream language with first-class support for preconditions and postconditions, although some non-Eiffel languages do support them (Clojure and D, for example). There’s also Pact, which bills itself as a contract testing tool, that sounds interesting but I haven’t had a chance to play with.

For coordination (compacts), I’d like to see explicit recognition of the difficulty of coordination and the significant role it plays during incidents. One of the positive outcomes of the growing popularity of resilience engineering and the learning from incidents in Software movement is the recognition that coordination is a critical activity that we should spend more time learning about.

Even the U.S. military

In 2019, ProPublica published a deeply researched series of stories called Disaster in the Pacific: Death and Neglect in the 7th Fleet about fatal military accidents at sea. As in all accidents, there are many contributing factors, as detailed in these stories. In this post I’m going to focus on one particular factor, as illustrated in the following story excerpts (emphasis mine)

The December 2018 flight was part of a week of hastily planned exercises that would test how prepared Fighter Attack Squadron 242 was for war with North Korea. But the entire squadron, not just Resilard, had been struggling for months to maintain their basic skills. Flying a fighter jet is a highly perishable skill, but training hours had been elusive. Repairs to jets were delayed. Pleadings up the chain of command for help and relief went ignored.

“Everyone believes us to be under-resourced, under-manned,” the squadron’s commander wrote to his superiors months earlier.
Faulty Equipment, Lapsed Training, Repeated Warnings: How a Preventable Disaster Killed Six Marines by Robert Faturechi, Megan Rose and T. Christian Miller, December 30, 2019

The review offered a critique of the Navy’s drive to save money by installing new technology rather than investing in training for its sailors.

“There is a tendency of designers to add automation based on economic benefits (e.g., reducing manning, consolidating discrete controls, using networked systems to manage obsolescence),” the report said, “without considering the effect to operators who are trained and proficient in operating legacy equipment.”
Collision Course by T. Christian Miller, Megan Rose, Robert Faturechi and Agnes Chang, December 20, 2019

The fleet was short of sailors, and those it had were often poorly trained and worked to exhaustion. Its warships were falling apart, and a bruising, ceaseless pace of operations meant there was little chance to get necessary repairs done. The very top of the Navy was consumed with buying new, more sophisticated ships, even as its sailors struggled to master and hold together those they had. The Pentagon, half a world away, was signing off on requests for ships to carry out more and more missions.

The risks were obvious, and Aucoin repeatedly warned his superiors about them. During video conferences, he detailed his fleet’s pressing needs and the hazards of not addressing them. He compiled data showing that the unrelenting demands on his ships and sailors were unsustainable. He pleaded with his bosses to acknowledge the vulnerability of the 7th Fleet.
Years of Warnings, Then Death and Disaster by Robert Faturechi, Megan Rose and T. Christian Miller, February 7, 2019

Then there was the crew. In those eight months, nearly 40 percent of the Fitzgerald’s crew had turned over. The Navy replaced them with younger, less-seasoned sailors and officers, leaving the Fitzgerald with the highest percentage of new crew members of any destroyer in the fleet. But naval commanders had skimped even further, cutting into the number of sailors Benson needed to keep the ship running smoothly. The Fitzgerald had around 270 people total — short of the 303 sailors called for by the Navy.

Key positions were vacant, despite repeated requests from the Fitzgerald to Navy higher-ups. The senior enlisted quartermaster position — charged with training inexperienced sailors to steer the ship — had gone unfilled for more than two years. The technician in charge of the ship’s radar was on medical leave, with no replacement. The personnel shortages made it difficult to post watches on both the starboard and port sides of the ship, a once-common Navy practice.

When the ship set sail in February 2017, it was supposed to be for a short training mission for its green crew. Instead, the Navy never allowed the Fitzgerald to return to Yokosuka. North Korea was launching missiles on a regular basis. China was aggressively sending warships to pursue its territorial claims to disputed islands off its coast. Seventh Fleet commanders deployed the Fitzgerald like a pinch hitter, repeatedly assigning it new missions to complete.
Death and Valor on an American Warship Doomed by its Own Navy, by T. Christian Miller, Megan Rose and Robert Faturechi, February 6, 2019

The U.S. Department of Defense may be the best-resourced organization in all of human history, with a 2020 budget of $738 billion. And yet, despite this fact, we still see a lack of resources as a contributing factor in the fatal U.S. military accidents described above.

The brutal reality is that, just because an organization is well resourced, does not exempt it from production pressures! Instead, a heavily resourced organization will have a larger scope: it will be asked to do more. As described in one of these excerpts, the Navy was focused on procuring new ships, at the expense of the state of the existing ones.

Lawrence Hirschhorn made the observation that every system is stretched to operate at its capacity, which is known as the law of stretched systems. Being given more resources means that you will eventually be asked to do more.

Not even the mighty U.S. Department of Defense can escape the adaptive universe.

Battleshorts, exaptations, and the limits of STAMP

A couple of threads got me thinking about the limits of STAMP.

The first thread was sparked by a link to a Hacker News comment, sent to me be a colleague of mine, Danny Thomas. This introduced me to a concept I hadn’t of heard of before, a battleshort. There’s even an official definition in a NATO document:

The capability to bypass certain safety features in a system to ensure completion of the mission without interruption due to the safety feature
AOP-38, Allied Ordnance Publication 38, Edition 3, Glossary of terms and definitions concerning the safety and suitability for service of munitions, explosives and related products, April 2002.

The second thread was sparked by a Twitter exchange between a UK Southern Railway train driver and the official UK Southern Railway twitter account:

People find a way to use the systems they have access to in order to get their work done, in ways never anticipated by the system designers. https://t.co/D7ll99aeB2
— Lorin Hochstein E_TOO_MANY_FAILURE_MODES (@lhochstein) August 12, 2020

This is a great example of exapting, a concept introduced by the paleontologists Stephen Jay Gould and Elisabeth Vrba. Exaptation is a solution to the following problem in evolutionary biology: what good is a partially functional wing? Either an animal can fly or it can’t, and a fully functional wing can’t evolve in a single generation, so how do the initial evolutionary stages of a wing confer advantage on the organism?

The answer is that: while a partially functional wing might be useless for flight, it might still be useful as a fin. And so, if wings evolved from fins, then the appendage may always confer an advantage at each evolutionary stage. The fin is exapted into a wing; it is repurposed to serve a new function. In the Twitter example above, the railway driver repurposed a social media service for communicating with his own organization.

Which brings us back to STAMP. One of the central assumptions of STAMP is that it is possible to construct an accurate enough control model of the system at the design stage to identify all of the hazards and unsafe control actions. You can see this assumption in action in the CAST handbook (CAST is STAMP’s accident analysis process) in the example questions from page 40 of the handbook (emphasis mine), which uses counterfactual reasoning to try identify flaws in the original hazard analysis.

Did the design account for the possibility of this increased pressure? If not, why not? Was this risk assessed at the design stage?

This seems like a predictable design flaw. Was the unsafe interaction between the two requirements (preventing liquid from entering the flare and the need to discharge gases to the flare) identified in the design or hazard analysis efforts? If so, why was it not handled in the design or in operational procedures? If it was not identified, why not?

Why wasn’t the increasing pressure detected and handled? If there were alerts, why did they not result in effective action to handle the increasing pressure? If there were automatic overpressurization control devices (e.g., relief valves), why were they not effective? If there were not automatic devices, then why not? Was it not feasible to provide them?

Was this type of pressure increase anticipated? If it was anticipated, then why was it not handled in the design or operational procedures? If it was not anticipated, why not?

Was there any way to contain the contents within some controlled area (barrier), at least the catalyst pellets?

Why was the area around the reactor not isolated during a potentially hazardous operation? Why was there no protection against catalyst pellets flying around?

This line of reasoning assumes that all hazards are, in principle, identifiable at the design stage. I think that phenomena like battleshorts and exaptations make this goal unattainable.

Now, in principle, nothing prevents an engineer using STPA (STAMP’s hazard analysis technique) from identifying scenarios that involve battleshorts and exaptations. After all, STPA is an exploratory technique. But I suspect that many of these kinds of adaptations are literally unimaginable to the designers.

Programming means never getting to say “it depends”

Consider the following scenario:

You’re on a team that owns and operates a service. It’s the weekend, and you’re not on call, and you’re out and about in the world, without your laptop. To alleviate boredom, you pick up your phone and look at your service dashboard, because you’re the kind of person that checks the dashboard every so often, even when you’re not on-call. You notice something unusual: the rate of blocked database transactions has increased significantly. This is the kind of signal you would look into further if you were sitting in front of a computer. Since you don’t have a computer handy, you open up Slack on your phone with the intention of pinging X, your colleague who is on-call for the day. When you enter your team channel, you notice that X is already in the process of investigating an issue with the service.

The question is: do you send X a Slack message about the anomalous signal you saw?

If you asked me this question (say, in an interview), I’d answer “it depends“. If I believed that X was already aware there were elevated blocked transactions, then I wouldn’t send them the message. On the other hand, if I thought they were not aware that blocked transactions was one of the symptoms, and I thought this was useful information to help them figure out what the underlying problem was, then I would send the message.

Now imagine you’re implementing automated alerts for your system. You need to make the same sort of decision: when do you redirect the attention (i.e., page) the on-call?

One answer is: “only page when you can provide information that needs to be acted on imminently, that the on-call doesn’t already have”. But this isn’t a very practical requirement, because the person implementing the alert is never going to have enough information to make an accurate judgment about what the on-call knows. You can use heuristics (e.g., suppress a page if a related page recently fired), but you can never know for certain whether that other page was “related” or was actually a different problem that happens to be coincident.

And therein lies the weakness of automation. For the designer implementing the automation, they will never have enough information at design time to implement the automation to make optimal decisions, because we can’t build automation today that has access to information like “what the on-call currently believes about the state of the world” or “disable the cluster if that’s what the user genuinely intended”. Even humans can’t build perfectly accurate mental models of the mental models of their colleagues, but we can do a lot better than software. For example, we can interpret Slack messages written by our colleagues to get a sense of what they believe.

Since the automation systems we build won’t ever have access to all of the inputs they need to make optimal decisions, when we are designing for cases that are ambiguous, we have to make a judgment call about what to do. We need to use heuristics such as “what is the common case” or “how can we avoid the worst case?” But if statements in our code never allow us to express “it depends on information inaccessible to the program”.

Abstractions and implicit preconditions

One of my favorite essays by Joel Spolsky is The Law of Leaky Abstractions. He phrases the law as:

All non-trivial abstractions, to some degree, are leaky.

One of the challenges with abstractions is that they depend upon preconditions: the world has to be in a certain state for the abstraction to hold. Sometimes the consumer of the abstraction is explicitly aware of the precondition, but sometimes they aren’t. After all, the advantage of an abstraction is that it hides information. NFS allows a user to access files stored remotely without having to know networking details. Except when there’s some sort of networking problem, and the user is completely flummoxed. The advantage of not having to know how NFS works has become a liability.

The problem of implicit preconditions is everywhere in complex systems. We are forever consuming abstractions that have a set of preconditions that must be true for the abstraction to work correctly. Poke at an incident, and you’ll almost always find an implicit precondition. Something we didn’t even know about, that always has to be true, that was always true, until now.

Abstractions make us more productive, and, indeed, we humans can’t build complex systems without them. But we need to be able to peel away the abstraction layers when things go wrong, so we can discover the implicit precondition that’s been violated.

A nudge in the right direction

I was involved in an operational surprise a few weeks ago where some of my colleagues, while not directly involved in handling the incident, nudged us in directions that helped with quick remediation.

In one case, a colleague suggested moving the discussion into a different Slack channel, and in another case, a colleague focused the attention on a potential trigger: some newly inserted database records.

I also remember another operational surprise where an experienced engineer asked someone in Slack, “Hey, there’s a new person on our team, can you explain what X means”, and the response kicked off a series of events which brought someone else in that had more context, which led to the surprise being remediated much more quickly.

These sorts of nudges fly under our radar, and so they’re easy to miss. But they can make the difference between an operational surprise with no customer a multi-hour outage, and they can be contingent on the right person who happens to be in the right Slack channel at the right time, seeing the right message.

Unless we treat this sort of activity as first class when looking at incidents, we won’t really understand how it can be that some incidents get resolved so quickly and some take much longer.

Thoughts on STAMP

STAMP is an accident model developed by the safety researcher Nancy Leveson. MIT runs annual STAMP workshop, and this year’s workshop was online due to COVID-19. This was the first time I’d attended the workshop. I’d previously read a couple of papers on STAMP, as well as Leveson’s book Engineering A Safer World, but I’ve never used the two associated processes: STPA for identifying safety requirements, and CAST for accident analysis.

This post captures my impressions of STAMP after attending the workshop.

But before I can talk about STAMP, I have to explain the safety term hazards.

Hazards

Safety engineers work to prevent unacceptable losses. The word safety typically invokes losses such as human death or injury, but it can be anything (e.g., property damage, mission loss).

Engineers have control over the systems that are designing, but they don’t have control over the environment that the system is embedded within. As a consequence, safety engineers focus their work on states that the system could be in that could lead to a loss.

As an example, consider the scenario where an autonomous vehicle is is driving very close to another vehicle immediately in front of it. An accident may or may not happen depending on whether the vehicle in front slams on the breaks. An autonomous vehicle designer has no control over whether another driver slams on the brakes (that’s part the an environment), but they can design the system to control the separation distance from the vehicle ahead. Thus, we’d say that the hazard is that minimum separation between the two cars is violated. Safety engineers work to identify, prevent, and mitigate hazards.

As a fan of software formal methods, I couldn’t help thinking about what the formal methods folks call safety properties. A safety property in the formal methods sense is a system state that we want to guarantee is unreachable. (An example of a safety property would be: no two threads can be in the same critical section at the same time).

The difference between hazards in the safety sense and safety properties in the software formal methods sense is:

a software formal methods safety property is always a software state
a safety hazard is never a software state

Software can contribute to a hazard occurring, but a software state by itself is never a hazard. Leveson noted in the workshop that software is just a list of instructions: and a list of instructions can’t hurt a human being. (Of course, software can cause something else to hurt a human being! But that’s an explanation for how the system got into a hazardous state, it’s not the hazard).

The concept of hazards isn’t specific to STAMP. There are a number of other safety techniques for doing hazard analysis, such as HAZOP, FMEA, FMECA, and fault tree analysis. However, I’m not familiar with those techniques so I won’t be drawing any comparisons here.

If I had to sum up STAMP in two words, they would be: control and design. STAMP is about control, in two senses.

Control

In one sense, STAMP uses a control system model for doing hazard analysis. This is a functional model of controllers which interact with each other through control actions and feedback. Because it’s a functional model, a controller is just something that exercises control: a software-based subsystem, a human operator, a team of humans, even an entire organization, each would be modeled as a controller in a description of the system used for doing the hazard analysis.

In another sense, STAMP is about achieving safety through controlling hazards: once they’re identified, you use STAMP to generate safety requirements for controls for preventing or mitigating the hazards.

Design

STAMP is very focused on achieving safety through proper system design. The idea is that you identify safety requirements for controlling hazards, and then you make sure that your design satisfies those safety requirements. While STAMP influences the entire lifecycle, the sense I got from Nancy Leveon’s talks was that accidents can usually be attributed to hazards that were not controlled for by the system designer.

Here are some general, unstructured impressions I had of both STAMP and Nancy Leveson. I felt I got a much better sense of STPA than CAST during the workshop, so most of my impressions are about STPA.

Where I’m positive

I like STAMP’s approach of taking a systems view, and making a sharp distinction between reliability and safety. Leveson is clearly opposed to the ideas of root cause and human error as explanations for accidents.

I also like the control structure metaphor, because it encourages me to construct a different kind of functional model of the system to help reason about what can go wrong. I also liked how software and humans were modeled the same way, as controllers, although I also think this is problematic (see below). It was interesting to contrast the control system metaphor with the joint cognitive system metaphor employed by cognitive systems engineering.

Although I’m not in a safety-critical industry, I think I could pick up and apply STPA in my domain. Akamai is an existence proof that you can use this approach in a non-safety-critical software context. Akamai has even made their STAMP materials available online.

I was surprised how influenced Leveson was by some of Sidney Dekker’s work, given that she has a very different perspective than him. She mentioned “just culture” several times, and even softened the language around “unsafe control actions” used in the CAST incident analysis process because of the blame connotations.

I enjoyed Leveson’s perspective on probability risk assessment: that it isn’t useful to try to compute risk probabilities. Rather, you build an exploratory model, identify the hazards, and control for them.Very much in tune with my own thoughts on using metrics to assess how you’re doing versus identifying problems. STPA feels like a souped-up version of Gary Klein’s premortem method of risk assessment.

I liked the structure of the STPA approach, because it provides a scaffolding for novices to start with. Like any approach, STPA requires expertise to use effectively, but my impression is that the learning curve isn’t that steep to get started with it.

Where I’m skeptical

STAMP models humans in the system the same way it models all other controllers: using a control algorithm and a process model. I feel like this is too simplistic. How will people use workarounds to get their work done in ways that vary from the explicitly documented procedures? Similarly, there was nothing at all about anomaly response, and it wasn’t clear to me how you would model something like production pressure.

I think Leveson’s response would be that if people employ workarounds that lead to hazards, that indicates that there was a hazard missed by the designer during the hazard analysis. But I worry that these workarounds will introduce new control loops that the designer wouldn’t have known about. Similarly, Leveson isn’t interested in issues around uncertainty and human judgment of operators, because those likewise indicates flaws in the design to control the hazards.

STAMP generates recommendations and requirements, and these will change the system, and these can reverberate through the system, introducing new hazards. While the conference organizers were aware of this, it wasn’t as much of a first class concept as I would have liked: the discussions in the workshop tended to end at the recommendation level. There was a little bit of discussion about how to deal with the fact that STAMP can introduce new hazards, but I would have liked to have seen more. It wasn’t clear to me how you could use a control system model to solve the envisioned world problem: how will the system change how people work?

Leveson is very down on agile methods for safety-critical systems. I don’t have a particular opinion there, but in my world, agile makes a lot of sense, and our system is in constant flux. I don’t think any subset of engineers has enough knowledge of the system to sketch out a control model that would capture all of the control loops that are present at any one point in time. We could do some stuff, and find some hazards, and that would be useful! But I believe that in my context, this limits our ability to fully explore the space of potential hazards.

Leveson is also very critical on Hollnagel’s idea of Safety-II, and she finds his portrayal of Safety-I as unrecognizable in her own experiences doing safety engineering. She believes that depending on the adaptability of the human operators is akin to blaming accidents on human error. That sounded very strange to my areas.

While there wasn’t as much discussion of CAST in the workshop itself, there was some, and I did skim the CAST handbook as well. CAST was too big on “why” rather than “how” for my taste, and many of the questions generated by CAST are counterfactuals, which I generally have an allergic reaction to (see page 40 of the handbook).

During the workshop, I asked if the workshop organizers (Nancy Leveson and John Thomas) if they could think of a case where a system designed using STPA had an accident, and they did not know of an example where that had happened, and that took me aback. They both insisted that STPA wasn’t perfect, but it worries me that they hadn’t yet encountered a situation that revealed any weaknesses in the approach.

Outro

I’m always interested in approaches that can help us identify problems, and STAMP looks promising on that front. I’m looking forward to opportunities to get some practice with STPA and CAST.

However, I don’t believe we can control all of the hazards in our designs, at least, not in the context that I work in. I think there are just too many interactions in the system, and the system is undergoing too much change every day. And that means that, while STAMP can help me find some of the hazards, I have to worry about phenomena like anomaly response, and how well the human operators can adapt when the anomalies inevitably happen.

Soak time

This is what makes separating “action items generation” from post-incident review meetings so valuable. Soak time works.

Your kid gets it. 🙂
— John Allspaw (@allspaw) April 28, 2020

Over the past few weeks, I’ve had the experience multiple times where I’m communicating with someone in a semi-synchronous way (e.g., Slack, Twitter), and I respond to them without having properly understood what they were trying to communicate.

In one instance, I figured out my mistake during the conversation, and in another instance, I didn’t fully get it until after the conversation had completed, and I was out for a walk.

In these circumstances, I find that I’m primed to respond based on my expectations, which makes me likely to misinterpret. The other person is rarely trying to communicate what I’m expecting them to. Too often, I’m simply waiting for my turn to talk instead of really listening to what they are trying to say.

It’s tempting to blame this on Slack or Twitter, but I think this principle applies in all synchronous or semi-synchronous communications: including face-to-face conversations. I’ve certainly experienced this when I’ve been in technical interviews, where my brain is always primed to think, “What answer is the interviewer looking for to that question?”

John Allspaw uses the term soak time to refer to the additional time it takes us to process the information we’ve received in a post-incident review meeting, so we can make better decisions about what the next steps are. I think it describes this phenomenon well.

Whether you call it soak time, l’esprit de l’escalier, or hammock-driven development, keep in mind that it takes time for your brain to process information. Give yourself permission to take that time. Insist on it.

In praise of the Wild West engineer

If you put software engineers on a continuum with slow and cautious at one end and Wild West at the other end, I’ve always been on the slow and cautious side. But I’ve come to appreciate the value of the Wild West engineer.

Here’s an example of Wild West behavior: during some sort of important operational task (let’s say a failover), the engineer carrying it out sees a Python stack trace appear in the UI. Whoops, there’s a bug in the operational code! They ssh to the box, fix the code, and then run it again. It works!

This sort of behavior used to horrify me. How could you just hotfix the code by ssh’ing to the box and updating the code by hand like that? Do you know how dangerous that is? You’re supposed to PR, run tests, and deploy a new image!

But here’s the thing. During an actual incident, the engineers involved will have to take risky actions in order to remediate. You’ve got to poke and prod at the system in a way that’s potentially unsafe in order to (hopefully!) make things better. As Richard Cook wrote, all practitioner actions are gambles. The Wild West engineers are the ones with the most experience making these sorts of production changes, so they’re more likely to be successful at them in these circumstances.

I also don’t think that a Wild West engineer is someone who simply takes unnecessary risks. Rather, they have a well-calibrated sense of what’s risky and what isn’t. In particular, if something breaks, they know how to fix it. Once, years ago, during an incident, I made a global change to a dynamic property in order to speed up a remediation, and a Wild West engineer I knew clucked with disapproval. That was a stupid risk I took. You always stage your changes across regions!

Now, simply because the Wild West engineers have a well-calibrated sense of risk, doesn’t mean their sense is always correct! David Woods notes that all systems are simultaneously well-adapted, under-adapted, and over-adapted. The Wild West engineer might miscalculate a risk But I think it’s a mistake to dismiss Wild West engineers as simply irresponsible. While I’m still firmly the slow-and-cautious type, when everything is on fire, I’m happy to have the Wild West engineers around to take those dangerous remediation actions. Because if it ends up making things worse, they’ll know how to handle it. They’ve been there before.

Month: August 2020