Incident analysis as guerrilla case study research

Today I tweeted this:

To which Sasha Rosenbaum asked:

This post is my response.

We seldom have time for introspection at work. If we’re lucky, we have the opportunity to do some kind of retrospective at the end of a project or sprint. But, generally speaking, we’re too busy working to spend time examining that work.

One exception to this is incidents: organizations are willing to spend effort on introspection after an incident happens. That’s because incidents are unsettling: people feel uneasy that the system failed in a way they didn’t expect.

And so, an organization is willing to spend precious engineering cycles in order to rid itself of the uneasy feeling that comes with a system failing unexpectedly. Let’s make sure this never happens again.

Incident analysis, in the learning from incidents in software (LFI) sense, is about using an incident as an opportunity to get a better understanding of how the overall system works. It’s a kind of case study, where the case is the incident. The incident acts as a jumping-off point for an analyst to study an aspect of the system. Just like any other case study, it involves collecting and synthesizing data from multiple sources (e.g., interviews, chat transcripts, metrics, code commits).

I call it a guerrilla case study because, from the organization’s perspective, the goal is really to get closure, to have a sense that all is right with the world. People want to get to a place where the failure mode is now well-understood and measures will be put in place to prevent it from happening again. As LFI analysts, we’re exploiting this desire for closure to justify spending time examining how work is really done inside of the system.

Ideally, organizations would recognize the value of this sort of work, and would make it explicit that the goal of incident analysis is to learn as much as possible. They’d also invest in other types of studies that look into how the overall system works. Alas, that isn’t the world we live in, so we have to sneak this sort of work in where we can.

The British General’s problem

Half a league, half a league,
Half a league onward,
All in the valley of Death
Rode the six hundred.
“Forward, the Light Brigade!
Charge for the guns!” he said:
Into the valley of Death
Rode the six hundred.

“Forward, the Light Brigade!”
Was there a man dismay’d?
Not tho’ the soldier knew
Some one had blunder’d:
Theirs not to make reply,
Theirs not to reason why,
Theirs but to do and die:
Into the valley of Death
Rode the six hundred.

Cannon to right of them,
Cannon to left of them,
Cannon in front of them
Volley’d and thunder’d;
Storm’d at with shot and shell,
Boldly they rode and well,
Into the jaws of Death,
Into the mouth of Hell
Rode the six hundred.

Flash’d all their sabres bare,
Flash’d as they turn’d in air
Sabring the gunners there,
Charging an army, while
All the world wonder’d:
Plunged in the battery-smoke
Right thro’ the line they broke;
Cossack and Russian
Reel’d from the sabre-stroke
Shatter’d and sunder’d.
Then they rode back, but not
Not the six hundred.

Cannon to right of them,
Cannon to left of them,
Cannon behind them
Volley’d and thunder’d;
Storm’d at with shot and shell,
While horse and hero fell,
They that had fought so well
Came thro’ the jaws of Death,
Back from the mouth of Hell,
All that was left of them,
Left of six hundred.

When can their glory fade?
O the wild charge they made!
All the world wonder’d.
Honor the charge they made!
Honor the Light Brigade,
Noble six hundred!

The Charge of the Light Brigade, Alfred Lord Tennyson

One of the challenges of building distributed systems is that communications channels are unreliable. This challenge is captured in what is known as the Two Generals’ Problem. It goes something like this:

Two generals from the same army are camped at opposite sides of a valley. The enemy troops are stationed in the valley. The generals need to coordinate their attack in order to succeed. Neither general will risk their troops unless they know for certain that the other general will attack at the same time.

They only mechanism the generals have to communicate is by sending messages via carrier pigeon that fly over the valley. The problem is that enemy archers in the valley are sometimes able to shoot down these carrier pigeons, and so a general sending a message has no guarantee that it will be received by the other general.

Two generals at opposing sides of a valley
One general sends a message to another across the valley via carrier pigeon

The first general sends the message: Let’s attack Monday at dawn, and waits for an acknowledgment. If he doesn’t get an acknowledgment, he can’t be certain the message was received, and so he can’t attack. He receives an acknowledgment in return. But now he thinks, “What if the other general doesn’t know I’ve received the acknowledgment? He knows I won’t carry out the attack unless I know for certain that he agrees to the plan, and he doesn’t know whether I’ve received an acknowledgment or not. I’d better send a message acknowledging the message.”

This becomes an infinite regress problem: it turns out that no number of acknowledgments and acknowledgment-of-acknowledgments that can ensure that the two generals have common knowledge (“I know that he knows that I know that he knows…”) when communicating over an unreliable channel.

One implicit assumption of the Two Generals’ Problem is that when a message is received, it is understood perfectly by the recipient. In the world of real systems, understanding the intent of a message is often a challenge. I’m going to call this the British General’s Problem, in honor of General Raglan.

Raglan was a British general during the Crimean War. The messages he sent to Lord Lucan at the front were misinterpreted, and these misinterpretations contributed to a disastrous attack by British Light Calvary against fortified Russian troops, famously memorialized in Alfred Lord Tennyson’s poem at the top of this post.

Stick figure holding a doc and confused by it
Lord Lucan is confused by General Raglan’s message

Tim Harford tells the story of the misunderstanding, and the different personalities involved in the affair, in the wonderful episode of the Cautionary Tales podcast The Curse of Knowledge meets the Valley of Death.

This story is a classic example of a common ground breakdown, a phenomenon described in the famous paper Common Ground and Coordination in Joint Activity by Gary Klein, Paul Feltovich, Jeffrey Bradshaw and David Woods. In the paper, the authors describe how common ground is necessary for people to coordinate effectively to achieve a shared goal. A common ground breakdown happens when a misunderstanding occurs among the people coordinating.

If you do operations work, and you’ve been involved in remediating an incident while communicating over Slack, you’ve experience the challenge of maintaining common ground. Because common ground breakdowns are common, coordination requires ongoing effort to repair this common ground. This is the topic of Laura Maguire’s PhD thesis: Controlling the Costs of Coordination in Large-scale Distributed Software Systems (which, I must admit, I haven’t yet read).

The British General’s problem is a reminder of the challenges we inevitably face in socio-technical systems. The real problem is not unreliable channels: it’s building and maintaining the shared understanding in order to get the coordination work done.

Putting on a different hat

(This post was inspired by a conversation I had with a colleague).

On the evening before the launch of the Challenger Space Shuttle, representatives from NASA and the engineering contractor Thiokol held a telecon where they were concerned about the low overnight temperatures at the launch site. The NASA and Thiokol employees discussed whether to proceed with the launch or cancel it. On the call, there’s an infamous exchange between two Thiokol executives:

It’s time to take off your engineering hat and put on your management hat.

Senior Vice President Jerry Mason to Vice President of Engineering Robert Lund

The quote implies a conflict between the prudence of engineering and management’s reckless indifference to risk. The story is more complex than this quote suggests, as the sociologist Diane Vaughan discovered in her study of NASA’s culture. Here’s a teaser of her research results:

Contradicting conventional understandings, we find that (1) in every [Flight Readiness Review], Thiokol engineers brought forward recommendations to accept risk and fly and (2) rather than amoral calculation and misconduct, it was a preoccupation with rules, norms, and conformity that governed all facets of controversial managerial decisions at Marshall during this period.

Diane Vaughan, The Challenger Launch Decision

But this blog post isn’t about the Challenger, or the contrasts between engineering and management. It’s about the times when we need to change hats.

I’m a fan of the you-build-it-you-run-it approach to software services, where the software engineers are responsible for operating the software they write. But solving ops problems isn’t like solving dev problems: the tempo and the skills involved are different, and they require different mindsets.

This difference is particularly acute for a software engineer when a change that they made contributed to an ongoing incident. Incidents are high pressure situations, and even someone in the best frame of mind can struggle with the challenges of making risky decisions under uncertainty. But if you’re thinking, “Argh, the service is down, and it’s all my fault!“, then your effectiveness is going to suffer. Your head’s not going to be in the right place.

And yet, these moments are exactly when it’s most important to be able to make the context switch between dev work and ops work. If someone took an action that triggered an outage, chances are good that they’re the person on the team who is best equipped to remediate, because they have the most context about the change.

Being the one who pushed the change that takes down the service sucks. But when we are most inclined to spend mental effort blaming ourselves is exactly when we can least afford to. Instead, we have to take off the dev hat, put on the ops hat, and do we can to get our head in the game. Because blaming ourselves in the moment isn’t going to make it any easier to get that service back up.

Slack’s Jan 2021 outage: a tale of saturation

Laura Nolan of Slack recently published an excellent write-up of their Jan. 4, 2021 outage on Slack’s engineering blog.

One of the things that struck me about this writeup is the contributing factors that aren’t part of this outage. There’s nothing about a bug that somehow made its way into a production, or an accidentally incorrect configuration change, or how some corrupt data ended up in the database. On the other hand, it’s an outage story with multiple examples of saturation.

Saturation is a phrase often used by the safety science researcher David Woods: it refers to a system that is reaching the limit of what it can handle. If you’ve done software operations work, I bet you’ve encountered resource exhaustion, which is an example of saturation.

Saturation plays a big role in Woods’s model of the adaptive universe. In particular, in socio-technical systems, people will adapt in order to reduce the risk of saturation. In this post, I’m going to walk Laura’s write-up, highlighting all of the examples of saturation and how the system adopted to it. I’m going purely from the text of the original write-up, which means I’ll likely get some things wrong here.

Slack runs their infrastructure on AWS. In the beginning, they (like, I presume, all small companies) started with a single AWS account. And, initially, this worked out well.

In the beginning, Slack’s AWS footprint fit nicely into one account and VPC: that’s one happy cloud!

However, as Slack grew, it encountered it problems. Here’s a quote from Slack’s blog post Building the Next Evolution of Cloud Networks at Slack by Archie Gunasekara:

As our customer base grew and the tool evolved, we developed more services and built more infrastructure as needed. However, everything we built still lived in one big AWS account. This is when our troubles started. Having all our infrastructure in a single AWS account led to AWS rate-limiting issues, cost-separation issues, and general confusion for our internal engineering service teams.

That cloud isn’t looking so happy anymore

The above quote makes reference to three different categories of saturation. The first is a traditional sort of limit we software folks think of: they were running into AWS rate limits associated with an individual AWS account.

The other two limits are cognitive: the system made it harder for humans to deal with separating out costs and, it led to confusion for internal teams. I still see these as a form of saturation: as a system gets more difficult for humans to deal with, it effectively increases the cost of using the system, and it makes errors more likely.

And so, the Slack Cloud Engineering team adapted to meet this saturation risk by adopting AWS child accounts. From the linked blog post again:

Now the service teams could request their own AWS accounts and could even peer their VPCs with each other when services needed to talk to other services that lived in a different AWS account.

Multiple happy child accounts

With continued growth, they eventually reached saturation again. Once again, this was the “it’s getting too hard” sort of saturation:

Having hundreds of AWS accounts became a nightmare to manage when it came to CIDR ranges and IP spaces, because the mis-management of CIDR ranges meant that we couldn’t peer VPCs with overlapping CIDR ranges. This led to a lot of administrative overhead.

VPC peering with multiple VPCs increases the cognitive load on the networking engineers

To deal with this risk of saturation, the cloud engineering team adapted again. They reached for new capabilities: AWS shared VPCs and AWS Transit Gateway Inter-Region Peering. By leveraging these technologies, they were able to design a network architecture that addressed their problems:

This solved our earlier issue of constantly hitting AWS rate limits due to having all our resources in one AWS account. This approach seemed really attractive to our Cloud Engineering team, as we could manage the IP space, build VPCs, and share them with our child account owners. Then, without having to worry about managing any of the overhead of setting up VPCs, route tables, or network access lists, teams were able to utilize these VPCs and build their resources on top of them.

Example of a transit gateway connecting multiple VPCs. Obviously, this does not accurately depict Slack’s actual network architecture.

Fast forward several months later. From Laura Nolan’s post:

On January 4th, one of our Transit Gateways became overloaded. The TGWs are managed by AWS and are intended to scale transparently to us. However, Slack’s annual traffic pattern is a little unusual: Traffic is lower over the holidays, as everyone disconnects from work (good job on the work-life balance, Slack users!). On the first Monday back, client caches are cold and clients pull down more data than usual on their first connection to Slack. We go from our quietest time of the whole year to one of our biggest days quite literally overnight. Our own serving systems scale quickly to meet these kinds of peaks in demand (and have always done so successfully after the holidays in previous years). However, our TGWs did not scale fast enough.

Traffic surge overwhelms the transit gateway

This is as clear an example of saturation as you can get: the incoming load increased faster than the transit gateways were able to cope. What’s really fascinating from this point on is the role that saturation plays in interactions with the rest of the system.

As too many of us know, clients experience a saturated network as an increase in latency. When network latency goes up, the threads in a service spend more of their time sitting there waiting for the bits to come over the network, which means that CPU utilization goes down.

Slack’s web tier autoscales on CPU utilization, so when the network started dropping packets, the instances in the web tier spent more of their time blocked, and CPU went down, which triggered the AWS autoscaler to downscale the web tier.

A server’s CPU utilization falls as it waits for packets that are being dropped by the overloaded transit gateway

However, the web tier has a scaling policy that rapidly upscales if thread utilization gets too high. (At Netflix, we use the term hammer rule to describe these type of emergency scale-up rule).

Too many threads are in use, waiting for those packets. Time for the scale-up hammer!

Once the new instances come online, an internal Slack service named provision-service is responsible for setting up these new instances so that they can serve traffic. And here, we see more saturation issues (emphasis mine).

Provision-service needs to talk to other internal Slack systems and to some AWS APIs. It was communicating with those dependencies over the same degraded network, and like most of Slack’s systems at the time, it was seeing longer connection and response times, and therefore was using more system resources than usual. The spike of load from the simultaneous provisioning of so many instances under suboptimal network conditions meant that provision-service hit two separate resource bottlenecks (the most significant one was the Linux open files limit, but we also exceeded an AWS quota limit).

While we were repairing provision-service, we were still under-capacity for our web tier because the scale-up was not working as expected. We had created a large number of instances, but most of them were not fully provisioned and were not serving. The large number of broken instances caused us to hit our pre-configured autoscaling-group size limits, which determine the maximum number of instances in our web tier.

The provision service is saturated, as is the autoscaling group, although many of the instances aren’t serving traffic because they aren’t fully provisioned.

Through a combination of robustness mechanisms (load balancer panic mode, retries, circuit breakers) and the actions of human operators, the system is restored to health.

As operators, we strive to keep our systems far from the point of saturation. As a consequence, we generally don’t have much experience with how the system behaves as it approaches saturation. And that makes these incidents much harder to deal with.

Making things worse, we can’t ever escape the risk of saturation. Often we won’t know that a limit exists until the system breaches it.

Error budgets and the legacy of Herbert Heinrich

Here’s a question that all of us software developers face: How can we best use our knowledge about the past behavior of our system to figure out where we should be investing our time?

One approach is to use a technique from the SRE world called error budgets. Here are a few quotes from the How to Use Error Budgets chapter of Alex Hidalgo’s book: Implementing Service Level Objectives:

Measuring error budgets over time can give you great insight into the risk factors that impact your service, both in terms of frequency and severity. By knowing what kinds of events and failures are bad enough to burn your error budget, even if just momentarily, you can better discover what factors cause you the most problems over time. p71 [emphasis mine]

The basic idea is straightforward. If you have error budget remaining, ship new features and push to production as often as you’d like; once you run out of it, stop pushing feature changes and focus on relaiability instead. p87

Error budgets give you ways to make decisions about your service, be it a single microservice or your company’s entire customer-facing product. They also give you indicators that tell you when you can ship features, what your focus should be, when you can experiment, and what your biggest risk factors are. p92

The goal is not to only react when your users are extremely unhappy with you—it’s to have better data to discuss where work regarding your service should be moving next. p354

That sounds reasonable, doesn’t it? Look at what’s causing your system to break, and if it’s breaking too often, use that as a signal to address those issues that are breaking it. If you’ve been doing really well reliability-wise, an error budget gives you margin to do some riskier experimentation in production like chaos engineering or production load testing.

I have two issues with this approach, a smaller one and a larger one. I’ll start with the smaller one.

First, I think that if you work on a team where the developers operate their own code (you-build-it, you-run-it), and where the developers have enough autonomy to say, “We need to focus more development effort on increasing robustness”, then you don’t need the error budget approach to help you decide when and where to spend your engineering effort. The engineers will know where the recurring problems are because they feel the operational pain, and they will be able to advocate for addressing those pain points. This is the kind of environment that I am fortunate enough to work in.

I understand that there are environments where the developers and the operators are separate populations, or the developers aren’t granted enough autonomy to be able to influence where engineering time is spent, and that in those environments, an error budget approach would help. But I don’t swim in those waters, so I won’t say any more about those contexts.

To explain my second concern, I need to digress a little bit to talk about Herbert Heinrich.


Herbert Heinrich worked for the Travelers Insurance Company in the first half of the twentieth century. In the 1920s, he did a study of workplace accidents, examining thousands of claims made by companies that held insurance policies with Travelers. In 1931, he published his findings in a book: Industrial Accident Prevention: A Scientific Approach.

Heinrich’s work showed a relationship between the rates of near misses (no injury), minor injuries, and major injuries. Specifically: for every major injury, there are 29 minor injuries, and 300 no-injury accidents. This finding of 1:29:300 became known as the accident triangle.

My reproduction of Heinrich’s accident pyramid. To see the original, check out The Heinrich/Bird safety pyramid: Pioneering research has become a safety myth at risk-engineering.org.

One implication of the accident triangle is that the rate of minor issues gives us insight into the rate of major issues. In particular, if we reduce the rate of minor issues, we reduce the risk of major ones. Or, as Heinrich put it: Moral—prevent the accidents and the injuries will take care of themselves.

Heinrich’s work has since been criticized, and subsequent research has contradicted Heinrich’s findings. I won’t repeat the criticisms here (see Foundations of Safety Science by Sidney Dekker for details), but I will cite counterexamples mentioned in Dekker’s book:

The Deepwater Horizon offshore drilling rig saw six years of injury-free and incident-free performance before the explosion in 2010. (It even won a SAFE award from the U.S. Minerals Management Service in 2008 for its perfect safety record!)

Arnold Barnett and Alexander Wang found a negative correlation between nonfatal accident/incident rates and passenger-mortality risk among air carriers. That is, carriers that had more non-fatal incidents had a lower risk of fatalities. (Passenger-mortality Risk Estimates Provide Perspectives About Airline Safety, Flight Safety Digest, April 2000).

Antti Saloniemi and Hanna Oksanen found a negative correlation between incident rate and fatalities in the construction industry in Finland (Accidents and fatal accidents—some paradoxes, Safety Science, Volume 29, Issue 1, June 1998).

Fred Sherratt and Andrew Dainty found that construction companies in the UK that had an explicit policy of zero accidents saw more major injuries and fatal accidents than companies that did not have a zero accident policy (UK construction safety: a zero paradox?, Policy and Practice in Health and Safety, Volume 15, Issue 2, 2017).


So, what does any of this have to do with error budgets? At a glance, error budgets don’t seem related to Heinrich’s work at all. Heinrich was focused on safety, where the goal is to reduce injuries as much as possible, in some cases explicitly having a zero goal. Error budgets are explicitly not about achieving zero downtime (100% reliability), they’re about achieving a target that’s below 100%.

Here are the claims I’m going to make:

  1. Large incidents are much more costly to organizations than small ones, so we should work to reduce the risk of large incidents.
  2. Error budgets don’t help reduce risk of large incidents.

Here’s Heinrich’s triangle redrawn:

An error-budget-based approach only provides information on the nature of minor incidents, because those are the ones that happen most often. Near misses don’t impact the reliability metrics, and major incidents blow them out of the water.

Heinrich’s work assumed a fixed ratio between minor accidents and major ones: reduce the rate of minor accidents and you’d reduce the rate of major ones. By focusing on reliability metrics as a primary signal for providing insight into system risk, you only get information about these minor incidents. But, if there’s no relationship between minor incidents and major ones, then maintaining a specific reliability level doesn’t address the issues around major incidents at all.

An error-budget-based approach to reliability implicitly assumes there is a connection between reliability metrics and the risk of a large incident. This is the thread that connects to Heinrich: the unstated idea that doing the robustness work to address the problems exposed by the smaller incidents will decrease the risk of the larger incidents.

In general, I’m skeptical about relying on predefined metrics, such as reliability, for getting insight into the risks of the system that could lead to big incidents. Instead, I prefer to focus on signals, which are not predefined metrics but rather some kind of information that has caught your attention that suggests that there’s some aspect of your system that you should dig into a little more. Maybe it’s a near-miss situation where there was no customer impact at all, or maybe it was an offhand remark made by someone in Slack. Signals by themselves don’t provide enough information to tell you where unseen risks are. Instead, they act as clues that can help you figure out where to dig to get more details. This is what the Learning from Incidents in Software movement is about.

I’m generally skeptical of metrics-based approaches, like error budgets, because they reify. The things that get measured are the things that get attention. I prefer to rely on qualitative approaches that leverage the experiment judgment of engineers. The challenge with qualitative approaches is that you need to expose the experts to the information they need (e.g., putting the software engineers on-call), and they need the space to dig into signals (e.g., allow time for incident analysis).

Making sense of what happened is hard

Scott Nasello recently introduced me to Dr. Hannah Harvey’s The Art of Storytelling. I’m about halfway through her course, and I absolutely love it, and I keep thinking about it in the context of learning from incidents. While I have long been an advocate of using narrative structure when writing up incidents, Harvey’s course focuses on oral storytelling, which is a very different sort of format.

In this context, I was thinking about an operational surprise that happened on my team a few months ago, so that I could use it as raw material to construct an oral story about it. But, as I reflected on it, and read my own (lengthy) writeup, I realized that there was one thing I didn’t fully understand about what happened.

During the operational surprise, when we attempted to remediate the problem by deploying a potential fix into production, we hit a latent bug that had been merged into the main branch ten days earlier. As i was re-reading the writeup, there was something I didn’t understand. How did it come to be that we went ten days without promoting that code from the main branch of our repo to the production environment?

To help me make sense of what happened, I drew a diagram of the development events that lead up to the surprise. Fortunately, I had documented those events thoroughly in the original writeup. Here’s the diagram I created. I used this diagram to get some insight into how bug T2, which was merged into our repo on day 0, did not manifest in production until day 10.

This diagram will take some explanation, so bear with me.

There are four bugs in this story, denoted T1,T2, A1, A2. The letters indicate the functionality associated with the PR that introduced them:

  • T1, T2 were both introduced in a pull request (PR) related to refactoring of some functionality related to how our service interacts with Titus.
  • A1, A2 were both introduced in a PR related to adding functionality around artifact metadata.

Note that bug T1 masked T2, and bug A1 masked A2.

There are three vertical lines, which show how the bugs propagated to different environments.

  • main (repo) represents code in the main branch of our repository.
  • staging represents code that has been deployed to our staging environment.
  • prod represents code that has been deployed to our production environment.

Here’s how the colors work:

  • gray indicates that the bug is present in an environment, but hasn’t been detected
  • red indicates that the effect of a bug has been observed in an environment. Note that if we detect a bug in the prod environment, that also tells us that the bug is in staging and the repo.
  • green indicates the bug has been fixed

If a horizontal line is red, that means there’s a known bug in that environment. For example, when we detect bug T1 in prod on day 1, all three lines go red, since we know we have a bug.

A horizontal line that is purple means that we’ve pinned to a specific version. We unpinned prod on day 10 before we deployed.

The thing I want to call out in this diagram is the color in the staging line. once the staging line turns red on day 2, it only turns black on day 5, which is the Saturday of a long weekend, and then turns red again on the Monday of the long weekend. (Yes, some people were doing development on the Saturday and testing in staging on Monday, even though it was a long weekend. We don’t commonly work on weekends, that’s a different part of the story).

During this ten day period, there was only a brief time when staging was in a state we thought was good, and that was over a weekend. Since we don’t deploy on weekends unless prod is in a bad state, it makes sense that we never deployed from staging to prod until day 10.

The larger point I want to make here is that getting this type of insight from an operational surprise is hard, in the sense that it takes a lot of effort. Even though I put in the initial effort to capture the development activity leading up to the surprise when I first did the writeup, I didn’t gain the above insight until months later, when I tried to understand this particular aspect of it. I had to ask a certain question (how did that bug stay latent for so long), and then I had to take the raw materials of the writeup that I did, and then do some diagramming to visualize the pattern of activity so I could understand it. In retrospect, it was worth it. I got a lot more insight here than: “root cause: latent bug”.

Now I just need to figure out how to tell this as a story without the benefit of a diagram.

Uber’s adventures in the adaptive universe

It’s 2016, and Uber engineers are facing a problem. Their software system has become brittle: many in the organization feel that it’s too hard to make changes to it without breaking things.

And so, they adapt: they build a new architecture, one that’s designed to enable teams to move more quickly. As part of the re-architecture, they reach for a new technology to rewrite the iOS client in: the Swift language.

The new architecture experiment is deemed a success, and is rolled out to the entire company. A florescence ensues in the organization, as teams excitedly migrate to the new architecture and experience a boost to their development productivity.

However, as development against the new architecture ramps up, anomalies related to Swift begin to emerge. Because of implementation details in the Swift linker, Apple recommends limiting the number of shared libraries to six: Uber has ninety-two, and the number is growing. The linker is saturated, and as a result, app startup is extremely slow. It takes eight to twelve seconds (!) to start up the app. The rewrite was supposed to yield a faster iOS app, and it’s slower than the previous version!

So the engineers adapt. They discover they can work around the problem by putting all of the code in the main executable instead of linking it via libraries, eliminating the startup delay. Unfortunately, to do this would require a huge code change because an implementation detail of Swift, but they find another workaround: an enterprising engineer writes a custom script to relink intermediate object files that avoids the need to change the code. And it works!

But they encounter another anomaly: the Swift-based iOS app binary is big… too big. It’s so big that they’re running into the Apple cellular download limit.

For users who want to download the Uber app to their iPhones over the cellular network, Apple places a hard limit of 100MB on the size of the download: any bigger, and the phone won’t let you download it unless you’re on wifi. Once again, the Uber engineers are hitting a saturation point, only now the limit is space instead of time. To add insult to injury, their workaround to deal with the startup time problem exacerbated the size problem!

There are further workarounds they can do to save space, like replace structs with classes. But it isn’t enough. The data scientists run an experiment to estimate the cost to the organization of the app breaching the cellular download limit: and the risk of catastrophic. It turns out that many people download the app for the first time on a cellular network. The estimated cost to the business is orders of magnitude more than the cost of the rewrite.

The engineers have to make some hard choices. Their original plan was to bundle the old and new versions of the app in the same app bundle, so that they could do a slow rollout to reduce the blast radius if there was a problem with the new version. They are facing a goal conflict, and so they make a sacrifice judgment. They remove the old version of this app. They call this the “Yolo” release strategy.

They face another goal conflict: they can take advantage of a new capability in iOS 9 that will reduce the binary size by 50%, but to do so they have to drop support for iOS 8. They estimate that this will decision will have a dollar of eight figures. With only a week to go before release, they drop iOS 8 and eat the cost to come get under the cellular download limit.

The engineers believe that dropping iOS 8 support should provide them with enough headroom to figure out a strategy for dealing with the 100 MB download limit, given the project slowdown in the growth of the app. But their model of the growth rate is wrong: the app is growing too quickly. There’s a risk of decompensation, of not being able to work around the growth rate of the app.

And so the engineers adapt. They form a strike team to come up with approaches for bringing the app size under control. They employ workarounds such as deleting unused features, checking for expensive code patterns, and rewriting the Apple Watch app in Objective C.

An Uber engineer in the Amsterdam office comes up with an innovative work around: he uses an annealing algorithm to re-order the Swift compiler’s optimization passes to minimize the size of the resulting binary. And it works! It also terrifies the Swift compiler engineers, as they haven’t tested running the optimization passes in arbitrary orders.

And yet, the risk of decompensation is ever-present: the strike team worries about their space saving wins will not be able to keep pace with the growth of the applications.

Fortunately, Apple moves the boundary: increasing the cellular download limit to 150 MB and introducing new size optimization features in the Swift compiler.


The above is my retelling of a Twitter thread by McLaren Stanley, a former Uber engineer. I highly recommend reading the original thread in full. My writing above is based solely on that thread, I don’t have any additional information, and I probably got some stuff wrong. I also created a concept map based on Stanley’s thread.

I wrote the post above using the frame of what the researcher David Woods calls the adaptive universe. I tried to cast events in terms of people undergoing pressure, encountering risks of saturation, and then adapting in the face of that pressure, and those adaptations leading to reverberations that introduce unexpected change in the system. Woods calls these adaptive cycles.

I’ve previously written briefly about the adaptive universe, but to learn more about this model, check out this material by Woods:

Why you should write up your own incident

You shouldn’t write up your own incident if you can avoid it. To write up an incident well, you need to be able to capture the perspectives of the different people who were involved. If the write-up author was also one of the responders, then the writeup will be biased towards their perspective, at the expense of capturing the perspectives of the other engineers who were engaged.

Unfortunately, most organizations haven’t committed the resources to support doing independent incident investigations. I happen to privileged enough to work at a company that has hired specialists who are skilled at doing independent incident investigations (J. Paul Reed and Jessica DeVita), Once upon a time (last year, to be precise), I was one of those independent incident investigators, before I transitioned back to being a software engineer.

However, even at my employer, we don’t have the resources to do an independent investigation for every single operational surprise that happens, and so the common case is still that a team has to investigate its own operational surprises.

Recently, I was one of the responders to one of these operational surprises. And, since I’m an advocate of teams putting in the effort to write up their operational surprises and share them with the org, I committed to doing that for my team.

During the operational surprise, we identified that certain database rows weren’t being updated, but we struggled to identify why they weren’t being updated. In the moment, We suspected the problem was somehow related to this function, which is responsible for updating those database rows. We believed (correctly, in hindsight) that the function was being called, because a log statement immediately preceding that function invocation appeared in the logs. But, somehow, the database updates weren’t taking effect.

In the moment, I was looking into whether there was something about the database itself that was preventing writes: perhaps some sort of database lock that was blocking updates? To investigate that, I manually translated the code from jOOQ library calls to raw SQL so I could run the queries directly against the database and see what happened.

In the end, it turned out that the problem was not related to the database itself, but to Kotlin code inside that function that was throwing an exception. It was erroring because the code made certain assumptions about the format of version strings, and those assumptions had become invalid over time. When this code hit a version string it couldn’t process, it threw an exception and triggered a transaction rollback.

After we remediated, when I looked back on the events of the day, I thought “Boy, I sure did waste a big chunk of time manually translating that code to SQL, when the problem wasn’t related to the database at all.

Later on, when I put my incident investigator hat on and pored over the Slack messages, I discovered something. While I was working to understand the code to translate it, I discovered that one of the queries in that function was too broad. Under normal circumstances, the broadness of the query wasn’t impacting the correctness of the function (the query after it was narrower) or the performance, but during the operational surprise it was increasing the blast radius of the issue. Narrowing the scope of that query was an important part of remediating the incident.

The thing is, until I was investigating the incident, I didn’t realize that I had learned about the broad query issue because I was working to translate the code into SQL. That work I did had real value: it helped us resolve the issue.

Ever since I’ve been bitten by the learning from incidents bug, I’ve been a believer in the value of using an independent investigator. But this is the first time I really had this first-hand experience of I learned something new about my own work in resolving the incident, even though I was there, because of the post-incident investigation work. It was quite a visceral realization.

And so, while you really should take advantage of independent investigators if resources permit, if you’ve worked as an independent investigator and then transition to a role which includes incident response, I recommend trying to write up one of your own incidents, at least once. It really reinforces how much more can be learned from an incident by doing a good investigation.

Taming complexity: from contract to compact

The software contract

We software engineers love the metaphor of the contract when describing software behavior: If I give you X, you promise to give me Y in return. One example of a contract is the signature of a function in a statically typed language. Here’s a function signature in the Kotlin programming language:

fun exportArtifact(exportable: Exportable): DeliveryArtifact

This signature promises that if you call the exportArtifact function with an argument of type Exportable, the return value will be an object of type DeliveryArtifact.

Function signatures are a special case for software contracts, in that they can be enforced mechanically: the compiler guarantees that the contract will hold for any program that compiles successfully. In general, though, the software contracts that we care about can’t be mechanically checked. For example, we might talk about a contract that a particular service provides, but we don’t have tools that can guarantee that our service conforms to the contract. That’s why we have to test it.

Contracts are a type of specification: they tell us that if certain preconditions are met, the system described by the contract guarantees that certain postconditions will be met in return. The idea of reasoning about the behavior of a program using preconditions and postconditions was popularized by C.A.R. Hoare in his legendary paper An Axiomatic Basis for Computer Programming, and is known today as Hoare logic. The language of contract in the software engineering sense was popularized by Bertrand Meyer (specifically, design by contract) in his language Eiffel and his book Object-Oriented Software Construction.

We software engineers like contracts they they help us reason about the behavior of a system. Instead of requiring us to understand the complete details of a system that we interact with, all we need to do is understand the contract.

For a given system, it’s easier to reason about its behavior given a contract than from implementation details.

Contracts, therefore, are a form of abstraction. In addition, contracts are composable, we can feed the outputs of system X into system Y if the postconditions of Y are consistent with the preconditions of X. Because we can compose contracts, we can use them to help us build systems out of parts that are described by contracts. Contracts are a tool that enable us humans to work together to build software systems that are too complex for any individual human to understand.

When contracts aren’t useful

Alas, contracts aren’t much use for reasoning about system behavior when either of the following two conditions happen:

  1. A system’s implementation doesn’t fully conform to its contract.
  2. The precondition of a system’s contract is violated by a client.
An example of a contract where a precondition (number of allowed dependencies) was violated

Whether a problem falls into the first or second condition is a judgment call. Either way, your system is now in a bad state.

A contract is of no use for a system that has gotten into a bad state.

A system that has gotten into a bad state is violating its contract, pretty much by definition. This means we must now deal with the implementation details of the system in order to get it back into a good state. Since no one person understands the entire system, we often need the help of multiple people to get the system back into a good state.

Operational surprises often require that multiple engineers work together to get the system back into a good state

Since contracts can’t help us here, we deal with the complexity by leveraging the fact that different engineers have expertise in different parts of the system. By working together, we are pooling the expertise of the engineers. To pull this off, the engineers need to coordinate effectively. Enter the Basic Compact.

The Basic Compact and requirements of coordination

Gary Klein, Paul Feltovich and David Woods defined the Basic Compact in their paper Common Ground and Coordination in Joint Activity:

We propose that joint activity requires a “Basic Compact” that constitutes a level of commitment for all parties to support the process of coordination. The Basic Compact is an agreement (usually tacit) to participate in the joint activity and to carry out the required coordination responsibilities.

One example of a joint activity is… when engineers assemble to resolve an incident! In doing so, they enter a Basic Compact: to work together to get the system back into a stable state. Working together on a task requires coordination, and the paper authors list three primary requirements to coordinate effectively on a joint activity: interpredictablity, common ground, and directability.

The Basic Compact is also a commitment to ensure a reasonable level of interpredictability. Moreover, the Basic Compact requires that if one party intends to drop out of the joint activity, he or she must inform the other parties.

Intepredictability is about being able to reason about the behavior of other people, and behaving in such a way that your behavior is reasonable to others. As with the world of software contracts, being able to reason about behavior is critical. Unlike software contracts, here we reasoning about agents rather than artifacts, and those agents are also reasoning about us.

The Basic Compact includes an expectation that the parties will repair faulty knowledge, beliefs and assumptions when these are detected.

Each engineer involved in resolving an incident has beliefs about both the system state and the beliefs of other engineers involved. Keeping mutual beliefs up to date requires coordination work.

During an incident, the responders need to maintain a shared understanding about information such as the known state of the system and what mitigations people are about to attempt. The authors use the term common ground to describe this shared understanding. Anyone who has been in in an on call rotation will find the following description familiar:

All parties have to be reasonably confident that they and the others will carry out their responsibilities in the Basic Compact. In addition to repairing common ground, these responsibilities include such elements as acknowledging the receipt of signals, transmitting some construal of the meaning of the signal back to the sender, and indicating preparation for consequent acts.

Maintaining common ground during an incident takes active effort on behalf of the participants, especially when we’re physically distributed and the situation is dynamic: where the system is not only in a bad state, but it’s in a bad state that’s changing over time. Misunderstandings can creep in, which the authors describe as a common ground breakdown that requires repair to make progress.

A common ground breakdown can mean the difference between a resolution time of minutes and hours. I recall an incident I was involved with, where an engineer made a relevant comment in Slack early on during the incident, and I missed its significance in the moment. In retrospect, I don’t know if the engineer who sent the message realized that I hadn’t properly processed its implications at the time.

Directability refers to deliberate attempts to modify the actions of the other partners as conditions and priorities change.

Imagine a software system has gone unhealthy in one geographical region, and engineer X begins to execute a failover to remediate. Engineer Y notices customer impact in the new region, and types into Slack, “We’re now seeing a problem in the region we’re failing into! Abort the failover!” This is an example of directability, which describes the ability of one agent to affect the behavior of another agent through signaling.

Making contracts and compacts first class

Both contracts and compacts are tools to help deal with complexity. People use contracts to help reason about the behavior of software artifacts. People use the Basic Compact to help reason about each other’s behavior when working together to resolve an incident.

I’d like to see both contracts and compacts get better treatment as first-class concerns. For contracts, there still isn’t a mainstream language with first-class support for preconditions and postconditions, although some non-Eiffel languages do support them (Clojure and D, for example). There’s also Pact, which bills itself as a contract testing tool, that sounds interesting but I haven’t had a chance to play with.

For coordination (compacts), I’d like to see explicit recognition of the difficulty of coordination and the significant role it plays during incidents. One of the positive outcomes of the growing popularity of resilience engineering and the learning from incidents in Software movement is the recognition that coordination is a critical activity that we should spend more time learning about.

Further reading and watching

Common Ground and Coordination in Joint Activity is worth reading in its entirety. I only scratched the surface of the paper in this post. John Allspaw gave a great Papers We Love talk on this paper.

Laura Maguire has done some recent PhD work on managing the hidden costs of coordination. She also gave a talk at QCon on the subject.

Ten challenges for making automation a “team player” in joint human-agent activity is a paper that explores the implications of building software agents that are capable of coordinating effectively with humans.

An Axiomatic Basis for Computer Programming is worth reading to get a sense of the history of preconditions and postconditions. Check out Jean Yang’s Papers We Love talk on it.

.

Thoughts on STAMP

STAMP is an accident model developed by the safety researcher Nancy Leveson. MIT runs annual STAMP workshop, and this year’s workshop was online due to COVID-19. This was the first time I’d attended the workshop. I’d previously read a couple of papers on STAMP, as well as Leveson’s book Engineering A Safer World, but I’ve never used the two associated processes: STPA for identifying safety requirements, and CAST for accident analysis.

This post captures my impressions of STAMP after attending the workshop.

But before I can talk about STAMP, I have to explain the safety term hazards.

Hazards

Safety engineers work to prevent unacceptable losses. The word safety typically invokes losses such as human death or injury, but it can be anything (e.g., property damage, mission loss).

Engineers have control over the systems that are designing, but they don’t have control over the environment that the system is embedded within. As a consequence, safety engineers focus their work on states that the system could be in that could lead to a loss.

As an example, consider the scenario where an autonomous vehicle is is driving very close to another vehicle immediately in front of it. An accident may or may not happen depending on whether the vehicle in front slams on the breaks. An autonomous vehicle designer has no control over whether another driver slams on the brakes (that’s part the an environment), but they can design the system to control the separation distance from the vehicle ahead. Thus, we’d say that the hazard is that minimum separation between the two cars is violated. Safety engineers work to identify, prevent, and mitigate hazards.

As a fan of software formal methods, I couldn’t help thinking about what the formal methods folks call safety properties. A safety property in the formal methods sense is a system state that we want to guarantee is unreachable. (An example of a safety property would be: no two threads can be in the same critical section at the same time).

The difference between hazards in the safety sense and safety properties in the software formal methods sense is:

  • a software formal methods safety property is always a software state
  • a safety hazard is never a software state

Software can contribute to a hazard occurring, but a software state by itself is never a hazard. Leveson noted in the workshop that software is just a list of instructions: and a list of instructions can’t hurt a human being. (Of course, software can cause something else to hurt a human being! But that’s an explanation for how the system got into a hazardous state, it’s not the hazard).

The concept of hazards isn’t specific to STAMP. There are a number of other safety techniques for doing hazard analysis, such as HAZOP, FMEA, FMECA, and fault tree analysis. However, I’m not familiar with those techniques so I won’t be drawing any comparisons here.


If I had to sum up STAMP in two words, they would be: control and design. STAMP is about control, in two senses.

Control

In one sense, STAMP uses a control system model for doing hazard analysis. This is a functional model of controllers which interact with each other through control actions and feedback. Because it’s a functional model, a controller is just something that exercises control: a software-based subsystem, a human operator, a team of humans, even an entire organization, each would be modeled as a controller in a description of the system used for doing the hazard analysis.

In another sense, STAMP is about achieving safety through controlling hazards: once they’re identified, you use STAMP to generate safety requirements for controls for preventing or mitigating the hazards.

Design

STAMP is very focused on achieving safety through proper system design. The idea is that you identify safety requirements for controlling hazards, and then you make sure that your design satisfies those safety requirements. While STAMP influences the entire lifecycle, the sense I got from Nancy Leveon’s talks was that accidents can usually be attributed to hazards that were not controlled for by the system designer.


Here are some general, unstructured impressions I had of both STAMP and Nancy Leveson. I felt I got a much better sense of STPA than CAST during the workshop, so most of my impressions are about STPA.

Where I’m positive

I like STAMP’s approach of taking a systems view, and making a sharp distinction between reliability and safety. Leveson is clearly opposed to the ideas of root cause and human error as explanations for accidents.

I also like the control structure metaphor, because it encourages me to construct a different kind of functional model of the system to help reason about what can go wrong. I also liked how software and humans were modeled the same way, as controllers, although I also think this is problematic (see below). It was interesting to contrast the control system metaphor with the joint cognitive system metaphor employed by cognitive systems engineering.

Although I’m not in a safety-critical industry, I think I could pick up and apply STPA in my domain. Akamai is an existence proof that you can use this approach in a non-safety-critical software context. Akamai has even made their STAMP materials available online.

I was surprised how influenced Leveson was by some of Sidney Dekker’s work, given that she has a very different perspective than him. She mentioned “just culture” several times, and even softened the language around “unsafe control actions” used in the CAST incident analysis process because of the blame connotations.

I enjoyed Leveson’s perspective on probability risk assessment: that it isn’t useful to try to compute risk probabilities. Rather, you build an exploratory model, identify the hazards, and control for them.Very much in tune with my own thoughts on using metrics to assess how you’re doing versus identifying problems. STPA feels like a souped-up version of Gary Klein’s premortem method of risk assessment.

I liked the structure of the STPA approach, because it provides a scaffolding for novices to start with. Like any approach, STPA requires expertise to use effectively, but my impression is that the learning curve isn’t that steep to get started with it.

Where I’m skeptical

STAMP models humans in the system the same way it models all other controllers: using a control algorithm and a process model. I feel like this is too simplistic. How will people use workarounds to get their work done in ways that vary from the explicitly documented procedures? Similarly, there was nothing at all about anomaly response, and it wasn’t clear to me how you would model something like production pressure.

I think Leveson’s response would be that if people employ workarounds that lead to hazards, that indicates that there was a hazard missed by the designer during the hazard analysis. But I worry that these workarounds will introduce new control loops that the designer wouldn’t have known about. Similarly, Leveson isn’t interested in issues around uncertainty and human judgment of operators, because those likewise indicates flaws in the design to control the hazards.

STAMP generates recommendations and requirements, and these will change the system, and these can reverberate through the system, introducing new hazards. While the conference organizers were aware of this, it wasn’t as much of a first class concept as I would have liked: the discussions in the workshop tended to end at the recommendation level. There was a little bit of discussion about how to deal with the fact that STAMP can introduce new hazards, but I would have liked to have seen more. It wasn’t clear to me how you could use a control system model to solve the envisioned world problem: how will the system change how people work?

Leveson is very down on agile methods for safety-critical systems. I don’t have a particular opinion there, but in my world, agile makes a lot of sense, and our system is in constant flux. I don’t think any subset of engineers has enough knowledge of the system to sketch out a control model that would capture all of the control loops that are present at any one point in time. We could do some stuff, and find some hazards, and that would be useful! But I believe that in my context, this limits our ability to fully explore the space of potential hazards.

Leveson is also very critical on Hollnagel’s idea of Safety-II, and she finds his portrayal of Safety-I as unrecognizable in her own experiences doing safety engineering. She believes that depending on the adaptability of the human operators is akin to blaming accidents on human error. That sounded very strange to my areas.

While there wasn’t as much discussion of CAST in the workshop itself, there was some, and I did skim the CAST handbook as well. CAST was too big on “why” rather than “how” for my taste, and many of the questions generated by CAST are counterfactuals, which I generally have an allergic reaction to (see page 40 of the handbook).

During the workshop, I asked if the workshop organizers (Nancy Leveson and John Thomas) if they could think of a case where a system designed using STPA had an accident, and they did not know of an example where that had happened, and that took me aback. They both insisted that STPA wasn’t perfect, but it worries me that they hadn’t yet encountered a situation that revealed any weaknesses in the approach.

Outro

I’m always interested in approaches that can help us identify problems, and STAMP looks promising on that front. I’m looking forward to opportunities to get some practice with STPA and CAST.

However, I don’t believe we can control all of the hazards in our designs, at least, not in the context that I work in. I think there are just too many interactions in the system, and the system is undergoing too much change every day. And that means that, while STAMP can help me find some of the hazards, I have to worry about phenomena like anomaly response, and how well the human operators can adapt when the anomalies inevitably happen.