Uber’s adventures in the adaptive universe

It’s 2016, and Uber engineers are facing a problem. Their software system has become brittle: many in the organization feel that it’s too hard to make changes to it without breaking things.

And so, they adapt: they build a new architecture, one that’s designed to enable teams to move more quickly. As part of the re-architecture, they reach for a new technology to rewrite the iOS client in: the Swift language.

The new architecture experiment is deemed a success, and is rolled out to the entire company. A florescence ensues in the organization, as teams excitedly migrate to the new architecture and experience a boost to their development productivity.

However, as development against the new architecture ramps up, anomalies related to Swift begin to emerge. Because of implementation details in the Swift linker, Apple recommends limiting the number of shared libraries to six: Uber has ninety-two, and the number is growing. The linker is saturated, and as a result, app startup is extremely slow. It takes eight to twelve seconds (!) to start up the app. The rewrite was supposed to yield a faster iOS app, and it’s slower than the previous version!

So the engineers adapt. They discover they can work around the problem by putting all of the code in the main executable instead of linking it via libraries, eliminating the startup delay. Unfortunately, to do this would require a huge code change because an implementation detail of Swift, but they find another workaround: an enterprising engineer writes a custom script to relink intermediate object files that avoids the need to change the code. And it works!

But they encounter another anomaly: the Swift-based iOS app binary is big… too big. It’s so big that they’re running into the Apple cellular download limit.

For users who want to download the Uber app to their iPhones over the cellular network, Apple places a hard limit of 100MB on the size of the download: any bigger, and the phone won’t let you download it unless you’re on wifi. Once again, the Uber engineers are hitting a saturation point, only now the limit is space instead of time. To add insult to injury, their workaround to deal with the startup time problem exacerbated the size problem!

There are further workarounds they can do to save space, like replace structs with classes. But it isn’t enough. The data scientists run an experiment to estimate the cost to the organization of the app breaching the cellular download limit: and the risk of catastrophic. It turns out that many people download the app for the first time on a cellular network. The estimated cost to the business is orders of magnitude more than the cost of the rewrite.

The engineers have to make some hard choices. Their original plan was to bundle the old and new versions of the app in the same app bundle, so that they could do a slow rollout to reduce the blast radius if there was a problem with the new version. They are facing a goal conflict, and so they make a sacrifice judgment. They remove the old version of this app. They call this the “Yolo” release strategy.

They face another goal conflict: they can take advantage of a new capability in iOS 9 that will reduce the binary size by 50%, but to do so they have to drop support for iOS 8. They estimate that this will decision will have a dollar of eight figures. With only a week to go before release, they drop iOS 8 and eat the cost to come get under the cellular download limit.

The engineers believe that dropping iOS 8 support should provide them with enough headroom to figure out a strategy for dealing with the 100 MB download limit, given the project slowdown in the growth of the app. But their model of the growth rate is wrong: the app is growing too quickly. There’s a risk of decompensation, of not being able to work around the growth rate of the app.

And so the engineers adapt. They form a strike team to come up with approaches for bringing the app size under control. They employ workarounds such as deleting unused features, checking for expensive code patterns, and rewriting the Apple Watch app in Objective C.

An Uber engineer in the Amsterdam office comes up with an innovative work around: he uses an annealing algorithm to re-order the Swift compiler’s optimization passes to minimize the size of the resulting binary. And it works! It also terrifies the Swift compiler engineers, as they haven’t tested running the optimization passes in arbitrary orders.

And yet, the risk of decompensation is ever-present: the strike team worries about their space saving wins will not be able to keep pace with the growth of the applications.

Fortunately, Apple moves the boundary: increasing the cellular download limit to 150 MB and introducing new size optimization features in the Swift compiler.


The above is my retelling of a Twitter thread by McLaren Stanley, a former Uber engineer. I highly recommend reading the original thread in full. My writing above is based solely on that thread, I don’t have any additional information, and I probably got some stuff wrong. I also created a concept map based on Stanley’s thread.

I wrote the post above using the frame of what the researcher David Woods calls the adaptive universe. I tried to cast events in terms of people undergoing pressure, encountering risks of saturation, and then adapting in the face of that pressure, and those adaptations leading to reverberations that introduce unexpected change in the system. Woods calls these adaptive cycles.

I’ve previously written briefly about the adaptive universe, but to learn more about this model, check out this material by Woods:

Software as a limited medium

The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination. Few media of creation are so flexible, so easy to polish and rework, so readily capable of realizing grand conceptual structures.

Fred Brooks, The Mythical Man-Month

We software engineers don’t work in a physical medium the way, say, civil, mechanical, electrical, or chemical engineers do. Yes, our software does run on physical machines, and we are not exempt from dealing with limits. But, as captured in that Fred Brooks quote above, there’s a sense in which we software folk feel that we are working in a medium that is limited only by our own minds, by the complexity of these ethereal artifacts we create. When a software system behaves in an unexpected way, we consider it a design flaw: the engineer was not sufficiently smart.

And, yet, contra Brooks, software is a limited medium. Let’s look at two areas where that’s the case.

Software is discrete in a way that the world isn’t

We persist our data in databases that have schemas, which force us to slice up our information in ways that we can represent. But the real world is not so amenable to this type of slicing: it’s a messy place. The mismatch between the messiness of the real world and the structured nature of software data representations results in a medium that is not well-suited to model the way humans treat concepts such as names or time.

Software as a medium, and data storage in particular, encourages over-simplification of the world, because we need to categorize our data, figure out which tables to store it in and what values those columns should have, and so many items in the world just aren’t easy to model well like that.

As an example, consider a common question in my domain, software deployment: is a cluster up? We have to make a decision about that, and yet the answer is often “it depends: why do you want to know”? But that’s not what software as a medium encourages. Instead, we pick a definition of “up”, implement it, and then hope that it meets most needs, knowing it won’t. We can come up with other definitions for other circumstances, but we can’t be comprehensive, and we can’t be flexible. We have to bake in those assumptions.

And so, just like all engineers, given our time and resource constraints, we have to make over-simplifications to get our work done. William Kent wrote a whole book on this topic called Data and Reality: A Timeless Perspective on Perceiving and Managing Information in Our Imprecise World (h/t Hillel Wayne).

Software systems are limited in how they integrate inputs

In the book Problem Frames, Michael Jackson describes several examples of software problems. One of them is a system for counting how many cars pass by on a street. The inputs are two sensors that emit a signal when the cars drive over them. Those two sensors provide a lot less input than a human would have sitting by the side of the road and counting the cars go by.

As humans, when we need to make decisions, we can flexibly integrate a lot of different information signals. If I’m talking to you, for example, I can listen to what you’re saying, and I can also read the expressions on your face. I can make judgments based on how you worded your Slack message, and based on how well I already know you. I can use all of that different information to build a mental model of your actual internal state. Software isn’t like that: we have to hard-code, in advance, the different inputs that the software system will use to make decisions. Software as a medium is inherently limited in modeling external systems that it interacts with.

A couple of months ago, I wrote a blog post titled programming means never getting to say “it depends”, where I used the example of an alerting system: when do you alert a human operator of a potential problem? As humans, we can develop mental models of the human operator: “does the operator already know about X? Wait, I see that they are engaged based on their Slack messages, so I don’t need to alert them, they’re already on it.”

Good luck building an alerting system that constructs a model of the internal state of a human operator! Software just isn’t amenable to incorporating all of the possible signals we might get from a system.

Recognizing the limits of software

The lesson here is that there are limits to how well software system can actually perform, given the limits of software. It’s not simply a matter of managing complexity or avoiding design flaws: yes, we can always build more complex schemas to handle more cases, and build our systems to incorporate large input sets, but this is the equivalent of adding epicycles. Incorrect categorizations and incorrect automated decisions are inevitable, no matter how complex our systems become. They are inherent to the nature of software systems. We’re always going to need to have humans-in-the-loop to make up for these sorts of shortcomings.

The goal is not to build better software systems, but how to build better joint cognitive systems that are made up of humans and software together.

Top-down code reviews

I’ve long been frustrated by the task of code reviews. Often, the pull request (PR) I’m reviewing involves a part of the codebase I’m not intimately familiar with. I read it, not quite understanding it, looking to see if I can offer some sort of useful feedback, and typically that feedback would be on the micro level (e.g., “you can simplify this function by calling this other library function instead”).

I recently started experimenting with a new review approach that I’m going to call top down code review. Here’s how it works: I start by understanding the code well enough so that I can write my own version of the pull request message, describing the PR in my own words. After I’ve done this, then I provide feedback.

I call this approach “top down” because the review that I end up generating starts with a “top down” description of the PR: the problem it’s trying to solve, and the solution approach, before diving into describing notable implementation details. Here are the reviews I’ve done in this style so far:

I’ve been finding this approach useful because it forces me to come to terms with how well I really understand the PR. If I can’t explain the PR in my own words, then I don’t really understand it. It also helps me figure out what questions to ask the original author to help clarify things for me.

I also get more of a sense of closure after doing the review. Even if I had no feedback to give, I understand the changes in a way that I didn’t before.

Why you should write up your own incident

You shouldn’t write up your own incident if you can avoid it. To write up an incident well, you need to be able to capture the perspectives of the different people who were involved. If the write-up author was also one of the responders, then the writeup will be biased towards their perspective, at the expense of capturing the perspectives of the other engineers who were engaged.

Unfortunately, most organizations haven’t committed the resources to support doing independent incident investigations. I happen to privileged enough to work at a company that has hired specialists who are skilled at doing independent incident investigations (J. Paul Reed and Jessica DeVita), Once upon a time (last year, to be precise), I was one of those independent incident investigators, before I transitioned back to being a software engineer.

However, even at my employer, we don’t have the resources to do an independent investigation for every single operational surprise that happens, and so the common case is still that a team has to investigate its own operational surprises.

Recently, I was one of the responders to one of these operational surprises. And, since I’m an advocate of teams putting in the effort to write up their operational surprises and share them with the org, I committed to doing that for my team.

During the operational surprise, we identified that certain database rows weren’t being updated, but we struggled to identify why they weren’t being updated. In the moment, We suspected the problem was somehow related to this function, which is responsible for updating those database rows. We believed (correctly, in hindsight) that the function was being called, because a log statement immediately preceding that function invocation appeared in the logs. But, somehow, the database updates weren’t taking effect.

In the moment, I was looking into whether there was something about the database itself that was preventing writes: perhaps some sort of database lock that was blocking updates? To investigate that, I manually translated the code from jOOQ library calls to raw SQL so I could run the queries directly against the database and see what happened.

In the end, it turned out that the problem was not related to the database itself, but to Kotlin code inside that function that was throwing an exception. It was erroring because the code made certain assumptions about the format of version strings, and those assumptions had become invalid over time. When this code hit a version string it couldn’t process, it threw an exception and triggered a transaction rollback.

After we remediated, when I looked back on the events of the day, I thought “Boy, I sure did waste a big chunk of time manually translating that code to SQL, when the problem wasn’t related to the database at all.

Later on, when I put my incident investigator hat on and pored over the Slack messages, I discovered something. While I was working to understand the code to translate it, I discovered that one of the queries in that function was too broad. Under normal circumstances, the broadness of the query wasn’t impacting the correctness of the function (the query after it was narrower) or the performance, but during the operational surprise it was increasing the blast radius of the issue. Narrowing the scope of that query was an important part of remediating the incident.

The thing is, until I was investigating the incident, I didn’t realize that I had learned about the broad query issue because I was working to translate the code into SQL. That work I did had real value: it helped us resolve the issue.

Ever since I’ve been bitten by the learning from incidents bug, I’ve been a believer in the value of using an independent investigator. But this is the first time I really had this first-hand experience of I learned something new about my own work in resolving the incident, even though I was there, because of the post-incident investigation work. It was quite a visceral realization.

And so, while you really should take advantage of independent investigators if resources permit, if you’ve worked as an independent investigator and then transition to a role which includes incident response, I recommend trying to write up one of your own incidents, at least once. It really reinforces how much more can be learned from an incident by doing a good investigation.

Taming complexity: from contract to compact

The software contract

We software engineers love the metaphor of the contract when describing software behavior: If I give you X, you promise to give me Y in return. One example of a contract is the signature of a function in a statically typed language. Here’s a function signature in the Kotlin programming language:

fun exportArtifact(exportable: Exportable): DeliveryArtifact

This signature promises that if you call the exportArtifact function with an argument of type Exportable, the return value will be an object of type DeliveryArtifact.

Function signatures are a special case for software contracts, in that they can be enforced mechanically: the compiler guarantees that the contract will hold for any program that compiles successfully. In general, though, the software contracts that we care about can’t be mechanically checked. For example, we might talk about a contract that a particular service provides, but we don’t have tools that can guarantee that our service conforms to the contract. That’s why we have to test it.

Contracts are a type of specification: they tell us that if certain preconditions are met, the system described by the contract guarantees that certain postconditions will be met in return. The idea of reasoning about the behavior of a program using preconditions and postconditions was popularized by C.A.R. Hoare in his legendary paper An Axiomatic Basis for Computer Programming, and is known today as Hoare logic. The language of contract in the software engineering sense was popularized by Bertrand Meyer (specifically, design by contract) in his language Eiffel and his book Object-Oriented Software Construction.

We software engineers like contracts they they help us reason about the behavior of a system. Instead of requiring us to understand the complete details of a system that we interact with, all we need to do is understand the contract.

For a given system, it’s easier to reason about its behavior given a contract than from implementation details.

Contracts, therefore, are a form of abstraction. In addition, contracts are composable, we can feed the outputs of system X into system Y if the postconditions of Y are consistent with the preconditions of X. Because we can compose contracts, we can use them to help us build systems out of parts that are described by contracts. Contracts are a tool that enable us humans to work together to build software systems that are too complex for any individual human to understand.

When contracts aren’t useful

Alas, contracts aren’t much use for reasoning about system behavior when either of the following two conditions happen:

  1. A system’s implementation doesn’t fully conform to its contract.
  2. The precondition of a system’s contract is violated by a client.
An example of a contract where a precondition (number of allowed dependencies) was violated

Whether a problem falls into the first or second condition is a judgment call. Either way, your system is now in a bad state.

A contract is of no use for a system that has gotten into a bad state.

A system that has gotten into a bad state is violating its contract, pretty much by definition. This means we must now deal with the implementation details of the system in order to get it back into a good state. Since no one person understands the entire system, we often need the help of multiple people to get the system back into a good state.

Operational surprises often require that multiple engineers work together to get the system back into a good state

Since contracts can’t help us here, we deal with the complexity by leveraging the fact that different engineers have expertise in different parts of the system. By working together, we are pooling the expertise of the engineers. To pull this off, the engineers need to coordinate effectively. Enter the Basic Compact.

The Basic Compact and requirements of coordination

Gary Klein, Paul Feltovich and David Woods defined the Basic Compact in their paper Common Ground and Coordination in Joint Activity:

We propose that joint activity requires a “Basic Compact” that constitutes a level of commitment for all parties to support the process of coordination. The Basic Compact is an agreement (usually tacit) to participate in the joint activity and to carry out the required coordination responsibilities.

One example of a joint activity is… when engineers assemble to resolve an incident! In doing so, they enter a Basic Compact: to work together to get the system back into a stable state. Working together on a task requires coordination, and the paper authors list three primary requirements to coordinate effectively on a joint activity: interpredictablity, common ground, and directability.

The Basic Compact is also a commitment to ensure a reasonable level of interpredictability. Moreover, the Basic Compact requires that if one party intends to drop out of the joint activity, he or she must inform the other parties.

Intepredictability is about being able to reason about the behavior of other people, and behaving in such a way that your behavior is reasonable to others. As with the world of software contracts, being able to reason about behavior is critical. Unlike software contracts, here we reasoning about agents rather than artifacts, and those agents are also reasoning about us.

The Basic Compact includes an expectation that the parties will repair faulty knowledge, beliefs and assumptions when these are detected.

Each engineer involved in resolving an incident has beliefs about both the system state and the beliefs of other engineers involved. Keeping mutual beliefs up to date requires coordination work.

During an incident, the responders need to maintain a shared understanding about information such as the known state of the system and what mitigations people are about to attempt. The authors use the term common ground to describe this shared understanding. Anyone who has been in in an on call rotation will find the following description familiar:

All parties have to be reasonably confident that they and the others will carry out their responsibilities in the Basic Compact. In addition to repairing common ground, these responsibilities include such elements as acknowledging the receipt of signals, transmitting some construal of the meaning of the signal back to the sender, and indicating preparation for consequent acts.

Maintaining common ground during an incident takes active effort on behalf of the participants, especially when we’re physically distributed and the situation is dynamic: where the system is not only in a bad state, but it’s in a bad state that’s changing over time. Misunderstandings can creep in, which the authors describe as a common ground breakdown that requires repair to make progress.

A common ground breakdown can mean the difference between a resolution time of minutes and hours. I recall an incident I was involved with, where an engineer made a relevant comment in Slack early on during the incident, and I missed its significance in the moment. In retrospect, I don’t know if the engineer who sent the message realized that I hadn’t properly processed its implications at the time.

Directability refers to deliberate attempts to modify the actions of the other partners as conditions and priorities change.

Imagine a software system has gone unhealthy in one geographical region, and engineer X begins to execute a failover to remediate. Engineer Y notices customer impact in the new region, and types into Slack, “We’re now seeing a problem in the region we’re failing into! Abort the failover!” This is an example of directability, which describes the ability of one agent to affect the behavior of another agent through signaling.

Making contracts and compacts first class

Both contracts and compacts are tools to help deal with complexity. People use contracts to help reason about the behavior of software artifacts. People use the Basic Compact to help reason about each other’s behavior when working together to resolve an incident.

I’d like to see both contracts and compacts get better treatment as first-class concerns. For contracts, there still isn’t a mainstream language with first-class support for preconditions and postconditions, although some non-Eiffel languages do support them (Clojure and D, for example). There’s also Pact, which bills itself as a contract testing tool, that sounds interesting but I haven’t had a chance to play with.

For coordination (compacts), I’d like to see explicit recognition of the difficulty of coordination and the significant role it plays during incidents. One of the positive outcomes of the growing popularity of resilience engineering and the learning from incidents in Software movement is the recognition that coordination is a critical activity that we should spend more time learning about.

Further reading and watching

Common Ground and Coordination in Joint Activity is worth reading in its entirety. I only scratched the surface of the paper in this post. John Allspaw gave a great Papers We Love talk on this paper.

Laura Maguire has done some recent PhD work on managing the hidden costs of coordination. She also gave a talk at QCon on the subject.

Ten challenges for making automation a “team player” in joint human-agent activity is a paper that explores the implications of building software agents that are capable of coordinating effectively with humans.

An Axiomatic Basis for Computer Programming is worth reading to get a sense of the history of preconditions and postconditions. Check out Jean Yang’s Papers We Love talk on it.

.

Even the U.S. military

In 2019, ProPublica published a deeply researched series of stories called Disaster in the Pacific: Death and Neglect in the 7th Fleet about fatal military accidents at sea. As in all accidents, there are many contributing factors, as detailed in these stories. In this post I’m going to focus on one particular factor, as illustrated in the following story excerpts (emphasis mine)

The December 2018 flight was part of a week of hastily planned exercises that would test how prepared Fighter Attack Squadron 242 was for war with North Korea. But the entire squadron, not just Resilard, had been struggling for months to maintain their basic skills. Flying a fighter jet is a highly perishable skill, but training hours had been elusive. Repairs to jets were delayed. Pleadings up the chain of command for help and relief went ignored.

Everyone believes us to be under-resourced, under-manned,” the squadron’s commander wrote to his superiors months earlier.

Faulty Equipment, Lapsed Training, Repeated Warnings: How a Preventable Disaster Killed Six Marines by Robert Faturechi, Megan Rose and T. Christian Miller, December 30, 2019

The review offered a critique of the Navy’s drive to save money by installing new technology rather than investing in training for its sailors.

“There is a tendency of designers to add automation based on economic benefits (e.g., reducing manning, consolidating discrete controls, using networked systems to manage obsolescence),” the report said, “without considering the effect to operators who are trained and proficient in operating legacy equipment.”

Collision Course by T. Christian Miller, Megan Rose, Robert Faturechi and Agnes Chang, December 20, 2019

The fleet was short of sailors, and those it had were often poorly trained and worked to exhaustion. Its warships were falling apart, and a bruising, ceaseless pace of operations meant there was little chance to get necessary repairs done. The very top of the Navy was consumed with buying new, more sophisticated ships, even as its sailors struggled to master and hold together those they had. The Pentagon, half a world away, was signing off on requests for ships to carry out more and more missions.

The risks were obvious, and Aucoin repeatedly warned his superiors about them. During video conferences, he detailed his fleet’s pressing needs and the hazards of not addressing them. He compiled data showing that the unrelenting demands on his ships and sailors were unsustainable. He pleaded with his bosses to acknowledge the vulnerability of the 7th Fleet.

Years of Warnings, Then Death and Disaster by Robert Faturechi, Megan Rose and T. Christian Miller, February 7, 2019

Then there was the crew. In those eight months, nearly 40 percent of the Fitzgerald’s crew had turned over. The Navy replaced them with younger, less-seasoned sailors and officers, leaving the Fitzgerald with the highest percentage of new crew members of any destroyer in the fleet. But naval commanders had skimped even further, cutting into the number of sailors Benson needed to keep the ship running smoothly. The Fitzgerald had around 270 people total — short of the 303 sailors called for by the Navy.

Key positions were vacant, despite repeated requests from the Fitzgerald to Navy higher-ups. The senior enlisted quartermaster position — charged with training inexperienced sailors to steer the ship — had gone unfilled for more than two years. The technician in charge of the ship’s radar was on medical leave, with no replacement. The personnel shortages made it difficult to post watches on both the starboard and port sides of the ship, a once-common Navy practice.

When the ship set sail in February 2017, it was supposed to be for a short training mission for its green crew. Instead, the Navy never allowed the Fitzgerald to return to Yokosuka. North Korea was launching missiles on a regular basis. China was aggressively sending warships to pursue its territorial claims to disputed islands off its coast. Seventh Fleet commanders deployed the Fitzgerald like a pinch hitter, repeatedly assigning it new missions to complete.

Death and Valor on an American Warship Doomed by its Own Navy, by T. Christian Miller, Megan Rose and Robert Faturechi, February 6, 2019

The U.S. Department of Defense may be the best-resourced organization in all of human history, with a 2020 budget of $738 billion. And yet, despite this fact, we still see a lack of resources as a contributing factor in the fatal U.S. military accidents described above.

The brutal reality is that, just because an organization is well resourced, does not exempt it from production pressures! Instead, a heavily resourced organization will have a larger scope: it will be asked to do more. As described in one of these excerpts, the Navy was focused on procuring new ships, at the expense of the state of the existing ones.

Lawrence Hirschhorn made the observation that every system is stretched to operate at its capacity, which is known as the law of stretched systems. Being given more resources means that you will eventually be asked to do more.

Not even the mighty U.S. Department of Defense can escape the adaptive universe.

Battleshorts, exaptations, and the limits of STAMP

A couple of threads got me thinking about the limits of STAMP.

The first thread was sparked by a link to a Hacker News comment, sent to me be a colleague of mine, Danny Thomas. This introduced me to a concept I hadn’t of heard of before, a battleshort. There’s even an official definition in a NATO document:

The capability to bypass certain safety features in a system to ensure completion of the mission without interruption due to the safety feature

AOP-38, Allied Ordnance Publication 38, Edition 3, Glossary of terms and definitions concerning the safety and suitability for service of munitions, explosives and related products, April 2002.

The second thread was sparked by a Twitter exchange between a UK Southern Railway train driver and the official UK Southern Railway twitter account:

This is a great example of exapting, a concept introduced by the paleontologists Stephen Jay Gould and Elisabeth Vrba. Exaptation is a solution to the following problem in evolutionary biology: what good is a partially functional wing? Either an animal can fly or it can’t, and a fully functional wing can’t evolve in a single generation, so how do the initial evolutionary stages of a wing confer advantage on the organism?

The answer is that: while a partially functional wing might be useless for flight, it might still be useful as a fin. And so, if wings evolved from fins, then the appendage may always confer an advantage at each evolutionary stage. The fin is exapted into a wing; it is repurposed to serve a new function. In the Twitter example above, the railway driver repurposed a social media service for communicating with his own organization.

Which brings us back to STAMP. One of the central assumptions of STAMP is that it is possible to construct an accurate enough control model of the system at the design stage to identify all of the hazards and unsafe control actions. You can see this assumption in action in the CAST handbook (CAST is STAMP’s accident analysis process) in the example questions from page 40 of the handbook (emphasis mine), which uses counterfactual reasoning to try identify flaws in the original hazard analysis.

Did the design account for the possibility of this increased pressure? If not, why not? Was this risk assessed at the design stage?

This seems like a predictable design flaw. Was the unsafe interaction between the two requirements (preventing liquid from entering the flare and the need to discharge gases to the flare) identified in the design or hazard analysis efforts? If so, why was it not handled in the design or in operational procedures? If it was not identified, why not?

Why wasn’t the increasing pressure detected and handled? If there were alerts, why did they not result in effective action to handle the increasing pressure? If there were automatic overpressurization control devices (e.g., relief valves), why were they not effective? If there were not automatic devices, then why not? Was it not feasible to provide them?

Was this type of pressure increase anticipated? If it was anticipated, then why was it not handled in the design or operational procedures? If it was not anticipated, why not?

Was there any way to contain the contents within some controlled area (barrier), at least the catalyst pellets?

Why was the area around the reactor not isolated during a potentially hazardous operation? Why was there no protection against catalyst pellets flying around?

This line of reasoning assumes that all hazards are, in principle, identifiable at the design stage. I think that phenomena like battleshorts and exaptations make this goal unattainable.

Now, in principle, nothing prevents an engineer using STPA (STAMP’s hazard analysis technique) from identifying scenarios that involve battleshorts and exaptations. After all, STPA is an exploratory technique. But I suspect that many of these kinds of adaptations are literally unimaginable to the designers.

Programming means never getting to say “it depends”

Consider the following scenario:

You’re on a team that owns and operates a service. It’s the weekend, and you’re not on call, and you’re out and about in the world, without your laptop. To alleviate boredom, you pick up your phone and look at your service dashboard, because you’re the kind of person that checks the dashboard every so often, even when you’re not on-call. You notice something unusual: the rate of blocked database transactions has increased significantly. This is the kind of signal you would look into further if you were sitting in front of a computer. Since you don’t have a computer handy, you open up Slack on your phone with the intention of pinging X, your colleague who is on-call for the day. When you enter your team channel, you notice that X is already in the process of investigating an issue with the service.

The question is: do you send X a Slack message about the anomalous signal you saw?

If you asked me this question (say, in an interview), I’d answer “it depends“. If I believed that X was already aware there were elevated blocked transactions, then I wouldn’t send them the message. On the other hand, if I thought they were not aware that blocked transactions was one of the symptoms, and I thought this was useful information to help them figure out what the underlying problem was, then I would send the message.

Now imagine you’re implementing automated alerts for your system. You need to make the same sort of decision: when do you redirect the attention (i.e., page) the on-call?

One answer is: “only page when you can provide information that needs to be acted on imminently, that the on-call doesn’t already have”. But this isn’t a very practical requirement, because the person implementing the alert is never going to have enough information to make an accurate judgment about what the on-call knows. You can use heuristics (e.g., suppress a page if a related page recently fired), but you can never know for certain whether that other page was “related” or was actually a different problem that happens to be coincident.

And therein lies the weakness of automation. For the designer implementing the automation, they will never have enough information at design time to implement the automation to make optimal decisions, because we can’t build automation today that has access to information like “what the on-call currently believes about the state of the world” or “disable the cluster if that’s what the user genuinely intended”. Even humans can’t build perfectly accurate mental models of the mental models of their colleagues, but we can do a lot better than software. For example, we can interpret Slack messages written by our colleagues to get a sense of what they believe.

Since the automation systems we build won’t ever have access to all of the inputs they need to make optimal decisions, when we are designing for cases that are ambiguous, we have to make a judgment call about what to do. We need to use heuristics such as “what is the common case” or “how can we avoid the worst case?” But if statements in our code never allow us to express “it depends on information inaccessible to the program”.

Abstractions and implicit preconditions

One of my favorite essays by Joel Spolsky is The Law of Leaky Abstractions. He phrases the law as:

All non-trivial abstractions, to some degree, are leaky.

One of the challenges with abstractions is that they depend upon preconditions: the world has to be in a certain state for the abstraction to hold. Sometimes the consumer of the abstraction is explicitly aware of the precondition, but sometimes they aren’t. After all, the advantage of an abstraction is that it hides information. NFS allows a user to access files stored remotely without having to know networking details. Except when there’s some sort of networking problem, and the user is completely flummoxed. The advantage of not having to know how NFS works has become a liability.

The problem of implicit preconditions is everywhere in complex systems. We are forever consuming abstractions that have a set of preconditions that must be true for the abstraction to work correctly. Poke at an incident, and you’ll almost always find an implicit precondition. Something we didn’t even know about, that always has to be true, that was always true, until now.

Abstractions make us more productive, and, indeed, we humans can’t build complex systems without them. But we need to be able to peel away the abstraction layers when things go wrong, so we can discover the implicit precondition that’s been violated.

A nudge in the right direction

I was involved in an operational surprise a few weeks ago where some of my colleagues, while not directly involved in handling the incident, nudged us in directions that helped with quick remediation.

In one case, a colleague suggested moving the discussion into a different Slack channel, and in another case, a colleague focused the attention on a potential trigger: some newly inserted database records.

I also remember another operational surprise where an experienced engineer asked someone in Slack, “Hey, there’s a new person on our team, can you explain what X means”, and the response kicked off a series of events which brought someone else in that had more context, which led to the surprise being remediated much more quickly.

These sorts of nudges fly under our radar, and so they’re easy to miss. But they can make the difference between an operational surprise with no customer a multi-hour outage, and they can be contingent on the right person who happens to be in the right Slack channel at the right time, seeing the right message.

Unless we treat this sort of activity as first class when looking at incidents, we won’t really understand how it can be that some incidents get resolved so quickly and some take much longer.