Quick takes on the GCP public incident write-up

On Thursday (2025-06-12), Google Cloud Platform (GCP) had an incident that impacted dozens of their services, in all of their regions. They’ve already released an incident report (go read it!), and here are my thoughts and questions as I read it.

Note that the questions I have shouldn’t be explicitly seen as a critique as of the write-up, as the answers to the questions generally aren’t publicly shareable. They’re more in the “I wish I could be a fly on the wall inside of Google” questions.

Quick write-up

First, a meta-point: this is a very quick turnaround for a public incident write-up. As a consumer of these, I of course appreciate getting it faster, and I’m sure there was enormous pressure inside of the company to get a public write-up published as soon as possible. But I also think there are hard limits on how much you can actually learn about an incident when you’re on the clock like this. I assume that Google is continuing to investigate internally how the incident happened, and I hope that they publish another report several weeks from now with any additional details that they are able to share publicly.

Staging land mines across regions

Note that impact (June 12) happened two weeks after deployment (May 29).

This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code.

The system involved is called Service Control. Google stages their deploys of Service Control by region, which is a good thing: staging your changes is a way of reducing the blast radius if there’s a problem with the code. However, in this case, the problematic code path was not exercised during the regional rollout. Everything looked good in the first region, and so they deployed to the next region, and so on.

This the land mine risk: when the code you are rolling out contains a land mine which is not tripped during the rollout.

How did the decisions make sense at the time?

I have no information about how this incident came to be but I can confidently predict that people will blame it on greedy execs and sloppy devs, regardless of what the actual details are. And they will therefore learn nothing from the details.

Lorin Hochstein (@norootcause.surfingcomplexity.com) 2024-07-19T19:17:47.843Z

The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash.

This is the typical “we didn’t do X in this case and had we done X, this incident wouldn’t have happened, or wouldn’t have been as bad” sort of analysis that is very common in these write-ups. The problem with this is that it implies sloppiness on the part of the engineers, that important work was simply overlooked. We don’t have any sense on how the development decisions made sense at the time.

If this scenario was atypical (i.e., usually error handling and feature flags are added), what was different about this development case? We don’t have the context about what was going on during development, which means we (as external readers) can’t understand how this incident actually was enabled.

Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging.

How do they know it would have been caught in staging, if it didn’t manifest in production until two weeks after roll-out? Are they saying that adding a feature flag would have led to manual testing of the problematic code path in staging? Here I just don’t know enough about Google’s development processes to make sense of this observation.

Service Control did not have the appropriate randomized exponential backoff implemented to avoid [overloading the infrastructure].

As I discuss later, I’d wager it’s difficult to test for this in general, because the system generally doesn’t run in the mode that would exercise this. But I don’t have the context, so it’s just a guess. What’s the history behind Service Control’s backoff behavior? By definition, Without knowing its history, we can’t really understand how its backoff implementation came to be this way.

Red buttons and feature flags

As a safety precaution, this code change came with a red-button to turn off that particular policy serving path. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. (emphasis added)

Because I’m unfamiliar with Google’s internals, I don’t understand how their “red button” system works. In my experience, the “red button” type functionality is built on top of feature flag functionality, but that does not seem to be the case at Google, since here there was no feature flag, but there was a big red button.

It’s also interesting to me that, while this feature wasn’t feature-flagged it was big-red-buttoned. There’s a story here! But I don’t know what it is.

New feature: additional policy quota checks

On May 29, 2025, a new feature was added to Service Control for additional quota policy checks… On June 12, 2025 at ~10:45am PDT, a policy change was inserted into the regional Spanner tables that Service Control uses for policies.

I have so many questions.. What were these additional quota policy checks? What was the motivation for adding these checks (i.e., what problem are the new checks addressing)? Is this customer-facing functionality (e.g., GCP Cloud Quotas), or is this an internal-only? What was the purpose of the policy change that was inserted on June 12 (or was it submitted by a customer)? Did that policy change take advantage of the new Service Control features that were added on May 29? Was that the first policy change that happened since the new feature was deployed, or had there been others? How frequently do policy changes happen?

Global data changes

Code changes are scary, config changes are scarier, and data changes are the scariest of them all.

Lorin Hochstein (@norootcause.surfingcomplexity.com) 2025-06-14T19:32:32.669Z

Given the global nature of quota management, this metadata was replicated globally within seconds.

While code and feature flag changes are staged across regions, apparently quota management metadata is designed to replicate globally.

Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues. (emphasis mine)

The implication I take from from the text was that there was a business requirement for quota management data changes to happen globally rather than staged, and that they are now going to push back on that.

What was the rationale for this business requirement? What are the tradeoffs involved in staging these changes versus having them happen globally? What new problems might arise when data changes are staged like this?

Are we going to be reading a GCP incident report in a few years that resulted from inconsistency of this data across regions due to this change?

Saturation!

From an operational perspective, I remain terrified of databases

Lorin Hochstein (@norootcause.surfingcomplexity.com) 2025-06-13T17:21:16.810Z

Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure.

Here we have a classic example of saturation, where a database got overloaded. Note that saturation wasn’t the trigger here, but it made recovery more difficult. Our system is in a different mode during incident recovery than it is during normal mode, and it’s generally very difficult to test for how it will behave when it’s in recovery mode.

Does this incident match my conjecture?

I have a long-standing conjecture that once a system reaches a certain level of reliability, most major incidents will involve:

  • A manual intervention that was intended to mitigate a minor incident, or
  • Unexpected behavior of a subsystem whose primary purpose was to improve reliability

I don’t have enough information in this write-up to be able to make a judgment in this case: it depends on whether or not the quota management system’s purpose is to improve reliability. I can imagine it going either way. If it’s a public-facing system to help customers limit their costs, then that’s more of a traditional feature. On the other hand, if it’s to limit the blast radius of individual user activity, then that feels like a reliability improvement system.

What are the tradeoffs of the corrective actions?

The write-up lists seven bullets of corrective actions. The questions I always have of corrective actions are:

  • What are the tradeoffs involved in implementing these corrective actions?
  • How might they enable new failure modes or make future incidents more difficult to deal with?

AI at Amazon: a case study of brittleness

A year ago, Mihail Eric wrote a blog post detailing his experiences working on AI inside Amazon: How Alexa Dropped the Ball on Being the Top Conversational System on the Planet. It’s a great first-person account, with lots of detail of the issues that kept Amazon from keeping up with its peers in the LLM space. From my perspective, Eric’s post makes a great case study in what resilience engineering researchers refer to as brittleness, which is a term that the researchers use to refer to as a kind of opposite of resilience.

In the paper Basic Patterns in How Adaptive Systems Fail, the researchers David Woods and Matthieu Branlat note that brittle systems tend to suffer from the following three patterns:

  1. Decompensation: exhausting capacity to adapt as challenges cascade
  2. Working at cross-purposes: behavior that is locally adaptive but globally maladaptive
  3. Getting stuck in outdated behaviors: the world changes but the system remains stuck in what were previously adaptive strategies (over-relying on past successes)

Eric’s post demonstrates how all three of these patterns were evident within Amazon.

Decompensation

It would take weeks to get access to any internal data for analysis or experiments
Experiments had to be run in resource-limited compute environments. Imagine trying to train a transformer model when all you can get a hold of is CPUs. Unacceptable for a company sitting on one of the largest collections of accelerated hardware in the world.

If you’ve ever seen a service fall over after receiving a spike in external requests, you’ve seen a decompensation system failure. This happens when a system isn’t able to keep up with the demands that are placed upon on it.

In organizations, you can see the decompensation failure pattern emerge when decision-making is very hierarchical: you end up having to wait for the decision request to make its way up to someone who has the authority to make the decision, and then make its way down again. In the meantime, the world isn’t standing still waiting for that decision to be made.

As described in the Bad Technical Process section of Eric’s post, Amazon was not able to keep up with the rate at which its competitors were making progress on developing AI technology, even though Amazon had both the talent and the compute resources necessary in order to make progress. The people inside the organization who needed the resources weren’t able to get them in a timely fashion. That slowed down AI development and, consequently, they got lapped by their competitors.

Working at cross-purposes

Alexa’s org structure was decentralized by design meaning there were multiple small teams working on sometimes identical problems across geographic locales.

This introduced an almost Darwinian flavor to org dynamics where teams scrambled to get their work done to avoid getting reorged and subsumed into a competing team.

The consequence was an organization plagued by antagonistic mid-managers that had little interest in collaborating for the greater good of Alexa and only wanted to preserve their own fiefdoms.

My group by design was intended to span projects, whereby we found teams that aligned with our research/product interests and urged them to collaborate on ambitious efforts. The resistance and lack of action we encountered was soul-crushing.

Where decompensation is a consequence of poor centralization, working at cross-purposes is a consequence of poor decentralization. In a decentralized organization, the individual units are able to work more quickly, but there’s a risk of alignment: enabling everyone to row faster isn’t going to help if they’re rowing in different directions.

In the Fragmented Org Structures section of Eric’s writeup, he goes into vivid, almost painful detail about how Amazon’s decentralized org structure worked against them.

Getting stuck in outdated behaviors

Alexa was viciously customer-focused which I believe is admirable and a principle every company should practice. Within Alexa, this meant that every engineering and science effort had to be aligned to some downstream product.

That did introduce tension for our team because we were supposed to be taking experimental bets for the platform’s future. These bets couldn’t be baked into product without hacks or shortcuts in the typical quarter as was the expectation.

So we had to constantly justify our existence to senior leadership and massage our projects with metrics that could be seen as more customer-facing.

This introduced product/science conflict in every weekly meeting to track the project’s progress leading to manager churn every few months and an eventual sunsetting of the effort.

I’m generally not a fan of management books, but What got you here won’t get you there is a pretty good summary of the third failure pattern: when organizations continue to apply approaches that were well-suited to problems in the past but are ill-suited to problems in the present.

In the Product-Science Misalignment section of his post, Eric describes how Amazon’s traditional viciously customer-focused approach to development was a poor match for the research-style work that was required for developing AI. Rather than Amazon changing the way they worked in order to facilitate the activities of AI researchers, the researchers had to try to fit themselves into Amazon’s pre-existing product model. Ultimately, that effort failed.


I write mostly about software incidents on this blog, which are high-tempo affairs. But the failure of Amazon to compete effectively in the AI space, despite its head start with Alexa, its internal talent, and its massive set of compute resources, can also be viewed as a kind of incident. As demonstrated in this post, we can observe the same sorts of patterns in failures that occur in the span of months as we can in failures that occur in the span of minutes. How well Amazon is able to learn from this incident remains to be seen.

Dijkstra never took a biology course

Simplicity is prerequisite for reliability. — Edsger W. Dijkstra

Think about a system whose reliability had significantly improved over some period of time. The first example that comes to my mind is commercial aviation, but I’d encourage you to think of a software system you’re familiar with, either as a user (e.g., Google, AWS) or as a maintainer of a system that’s gotten more reliable over time.

Think of a system where the reliability trend looks like this

Now, for the system you have thought about where its reliability increased over time, think about what the complexity trend looks like over time for that system. I’d wager you’d see a similar sort of trend.

My claim about what the complexity trend looks like over time

Now, in general, increases in complexity don’t lead to increases in reliability. In some cases, engineers make a deliberate decision to trade off reliability for new capabilities. The telephone system today is much less reliable than it was when I was younger. As someone who grew up in the 80s and 90s, the phone system was so reliable that it was shocking to pick up the phone and not hear a dial tone. We were more likely to experience a power failure than a telephony outage, and the phones still worked when the power was out! I don’t think we even knew the term “dropped call”. Connectivity issues with cell phones are much more common than they ever were with landlines. But this was a deliberate tradeoff: we gave up some reliability in order to have ubiquitous access to a phone.

Other times, the increase in complexity isn’t the product of an explicit tradeoff but rather an entropy-like effect of a system getting more difficult to deal with over time as it accretes changes. This scenario, the one that most people have in mind when they think about increasing complexity in their system, is synonymous with the idea of tech debt. With tech debt the increase in complexity makes the system less reliable, because the risk of making a breaking change in the system has increased. I started this blog post with a quote from Dijkstra about simplicity. Here’s another one, along the same lines, from C.A.R. Hoare’s Turing Award Lecture in 1980:

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.

What Dijkstra and Hoare are saying is: the easier a software system is to reason about, the more likely it is to be correct. And this is true: when you’re writing a program, the simpler the program is, the more likely that you are to get it right. However, as we scale up from individual programs to systems, this principle breaks down. Let’s see how that happens.

Djikstra claims simplicity is a prerequisite for reliability. According to Dijkstra, if we encounter a system that’s reliable, it must be a simple system, because simplicity is required to achieve reliability.

reliability ⇒ simplicity

The claim I’m making in this post is the exact opposite: systems that improve in reliability do so by adding features that improves reliability, but come at the cost of increased complexity.

reliability ⇒ complexity

Look at classic works on improving the reliability of real-world systems like Michael Nygard’s Release It!, Joe Armstrong’s Making reliable distributed systems in the presence of software errors, and Jim Gray’s Why Do Computers Stop and What Can Be Done About It? and think about the work that we do to make our software systems more reliable, functionality like retries, timeouts, sharding, failovers, rate limiting, back pressure, load shedding, autoscaling, circuit breakers, transactions, and auxiliary systems we have to support our reliability work like an observability stack. All of this stuff adds complexity.

Imagine if I took a working codebase and proposed deleting all of the lines of code that are involved in error handling. I’m very confident that this deletion of code would make the codebase simpler. There’s a reason that programming books tend to avoid error handling cases in their examples, they do increase complexity! But if you were maintaining a reliable software system, I don’t think you’d be happy with me if I submitted a pull request that deleted all of the error handling code.

Let’s look at the natural world, where biology provides us with endless examples of reliable systems. Evolution has designed survival machines that just keep on going; they can heal themselves in simply marvelous ways. We humans haven’t yet figured out how to design systems which can recover from the variety of problems that a living organism can. Simple, though, they are not. They are astonishingly, mind-boggling-y complex. Organisms are the paradigmatic example of complex adaptive systems. However complex you think biology is, it’s actually even more complex than that. Mother nature doesn’t care that humans struggle to understand her design work.

Now, I’m not arguing that this reliability-that-adds-complexity is a good thing. In fact, I’m the first person who will point out that this complexity in service of reliability creates novel risks by enabling new failure modes. What I’m arguing instead is that achieving reliability by pursuing simplicity is a mirage. Yes, we should pay down tech debt and simplify our systems by reducing accidental complexity: there are gains in reliability to be had through this simplifying work. But I’m also arguing that successful systems are always going to get more complex over time, and some of that complexity is due to work that improves reliability. Successful reliable systems are going to inevitably get more complex. Our job isn’t to reduce that complexity, it’s to get better at dealing with it.

Not causal chains, but interactions and adaptations

I’ve been a bit of an anti-root-cause-analysis (RCA) tear lately. On LinkedIn, health and safety expert Vincent Theobald-Vega left a thoughtful, detailed comment on my last post. In his comment, he noted that RCA done poorly leads to bad results, and he pointed me to what he described as a good guide to using the RCA approach: Investigating accidents and incidents. This is a free book published by the UK Health and Safety Executive.

However, after reading this guide, my perception of RCA has not improved. I still believe that RCA is based on a fundamentally incorrect model of complex systems failure. To clarify my thinking, I’ve tried to explain where I see its flaws in this post.

A quick note on terminology: while the guide uses the term accident, I’m going to use the term incident instead, to remain consistent with the usage in the software domain. The guide uses incident to refer to near misses.

Some content in the guide that I appreciated

While I disagree with RCA as described in the guide, I wanted to start by pointing out areas of agreement I had with the guide.

Not just a single cause

The guide does note that there are multiple causes involved in incidents. It notes that adverse events have many causes (p6), and it also mentions that Very often, a number of chance occurrences and coincidences combine to create the circumstances in which an adverse event can happen. All these factors should be recorded here in chronological order, if possible. (p10).

While I disagree with the causal language, I do at least appreciate that it points out there are multiple factors.

Examine how work is really done

The guide does talk about examining the work and the context under which it takes place. Under “information and insights gained from an investigation”, one of the bullet points is A true snapshot of what really happens and how work is really done (p7).

Under the “Gathering detailed information: How and what? section, the guide asks What activities were being carried out at the time? and Was there anything unusual or different about the working conditions?” (p15)

“Human error” is not a valid conclusion

The guide is opposed to the idea of human error being identified as a cause. It notes that Investigations that conclude that operator error was the sole cause are rarely acceptable. Underpinning the ‘human error’ there will be a number of underlying causes that created the environment in which human errors were inevitable. (p10)

Examine your near misses

Finally, the guide does point out the value in investigating near misses, noting that While the argument for investigating accidents is fairly clear, the need to investigate
near misses and undesired circumstances may not be so obvious. However, investigating near misses and undesired circumstances is as useful, and very much easier than investigating accidents.
(p8)

The RCA model of incidents

Here’s my attempt to sketch out a conceptual model of how incidents happened, according to the guide.

The guide distinguishes between three different types of causes:

  • Immediate causethe most obvious reason why an adverse event happens (p4)
  • Underlying cause the less obvious ‘system’ or ‘organisational’ reason for an adverse event happening (p5)
  • Root cause – an initiating event or failing from which all other causes or failings spring. Root causes are generally management, planning or organisational failings (p5).
How root causes lead to incidents

The idea is that there is a causal chain from root cause to underlying cause to immediate cause. A combination of these immediate causes, along with chance occurrences and coincidences, combine to enable the incident.

The guide uses the metaphor of a sequence of dominos to describe this causal chain, where the initial domino (labeled “A” in the diagram below) is a root cause, and the domino labeled “B” an immediate cause.

Source: Investigating accidents and incidents, UK Health and Safety Executive, figure 4, p6

If left unaddressed, these root causes will lead to multiple incidents in the future. Hence, the goal of an RCA is to identify and eliminate the root causes in order to prevent recurrence of the incident:

The same accidents happen again and again, causing suffering and distress to an ever-widening circle of workers and their families… The findings of the investigation will form the basis of an action plan to prevent the accident or incident from happening again (p4, emphasis mine)

To get rid of weeds you must dig up the root. If you only cut off the foliage, the weed will grow again. Similarly it is only by carrying out investigations which identify root causes that organisations can learn from their past failures and prevent future failures.(p9, emphasis mine)

The RE model of incidents

My claim is that the RCA model of incidents is dangerously incorrect about the nature of failure in complex systems. More importantly, these flaws in the RCA model lead to sub-optimal outcomes for incident investigations. In other words, we can do a lot better than RCA if we have a different model about how incidents happen.

The best way to illustrate this is to describe an alternative model that I believe more accurately models complex systems failures, and results in better investigation outcomes. I’m going to call it the resilience engineering (RE) model in this blog post, partly to encourage folks to explore the research field of resilience engineering, and partly as a way to encourage folks to check out the Resilience in Software Foundation. But you may have heard terms associated with this model, such as the New Look, the New View, Safety-II, and Learning from Incidents (LFI). My favorite summary of the RE model is Richard Cook’s very short paper How Complex Systems Fail.

Not causes but interactions

Where RCA treats causes as the first class entities of an incident, RE instead treats interactions as the first-class entity. It is the unexpected interactions of the different components in a complex system that enables the incident to occur.

Oh, what a tangled web!

Note that there’s no causal chain in this model. Instead, it’s an ever-branching web of contributing factors, which each factor is itself is influenced potentially influenced by other factors, and so on. I like how John Allspaw uses the expression the infinite hows to draw a contrast to the causal chain view of five whys. I once proposed the metaphor of the Gamma knife as a way to imagine how these interactions come together to enable an incident.

Labeling the behavior of the individual components as causes is dangerous because it obscures the fact that the problem was not the individual components themselves but that separate subsystems interacted in ways that were unpredictable and harmful. Modern software systems are essentially control systems with multiple feedback loops, and it’s effectively impossible for humans to predict how these loops are going to interact with each other and with the range of possible inputs we might throw at them. You don’t have to look any further than Kubernetes to understand both the value and the surprising behavior of feedback systems.

Under the RE model, incidents are perfect storms of complex interactions across multiple components under a particular set of circumstances. Even though this incident revealed a dangerous interaction between components A and B, the next incident may be an interaction between components D and E, and the D-E interaction may be even more likely to occur than the A-B one to re-occur.

In addition, changing the behavior of components A or B might enable new failure modes by creating the opportunity for new unexpected interactions with other components, even though it has prevented the A-B interaction.

Adaptations to compensate for existing faults

Here’s a different picture. Imagine your system as a collection of components, which I’ve denoted here as rounded boxes. To keep things simple, I’m not going to show the interactions

A collection of components that are part of your system.

Now, imagine that you experience an incident, you do an RCA, and you identify as the underlying causes that two of the components behaved incorrectly in some way. There was a fault in those components that wasn’t noticed before the incident.

The RCA reveals that the underlying causes were the behavior of the two components, shaded here in red

The RCA model would look for the root cause of these faults, perhaps a problem in the way that these components were validated. For example, perhaps there was a certain type of testing that wasn’t done, and that’s how the problem went undetected. As a result, not only would these two components be fixed, but we would also have improved the process by which we verify components, meaning fewer component problems in the future.

Now, let’s look at the RE model. This model tells us that there are what Cook calls latent failures that are distributed throughout the system: they’re there, but we don’t know where they are. Sometimes these latent failures are referred to as faults.

In addition to the known failures in red, there are a large number of unseen latent failures

Despite the presence of all of these faults in our complex system, the system actually functions most of the time. Cook describes this by observing that complex systems are heavily and successfully defended against failure and complex systems run in degraded mode. Even though your system is riddled with faults, it still functions well enough to be useful, although it never functions perfectly.

This is actually one of the secrets of services that seem reliable to their end users. It’s not that they never encounter problems, it’s that they are able to compensate for those problems in order to keep working correctly. In the RE model, successful complex systems are always fault-tolerant, because they need to be in order to succeed.

Because there are so many latent failures, and they change over time, the RCA approach (find a root cause, and root it out) doesn’t work under the RE model to generate continuous improvement. Because an incident was due to a random combination of multiple latent failures, and because there are so many of these failures, simply eliminating the recurrence of a specific combination doesn’t buy you much: the future incidents are very likely to be different because they’ll involve novel combinations of latent failures that you don’t see.

In contrast, the RE approach emphasizes the idea of identifying how your system adapts to succeed in the presence of all these faults. The desired outcomes of this approach are to increase your ability to continue to adapting to faults in the future, as well as to find areas in your system where you are less able to adapt effectively. It means understanding that your system is fault tolerant, and using incidents to understand how the people in your system are able to adapt to deal with faults.

This also includes understanding how those adaptations can fail to keep the system running. Because, when an incident happened, those adaptations weren’t sufficient. But there’s a huge difference between “this process led to a fault and so it needs to be changed” (RCA) and “the way we normally work is typically effective at working around problem X, but it didn’t work in these particular circumstances because Y and Z and …”

The RCA approach is about finding the generators of faults in your organization and removing them. The RE approach is about finding the sources of fault tolerance in your organization so you can nurture and grow them. The RE folks call this adaptive capacity. Remember, your system contains a multitude of undiscovered faults, and those faults will ultimately result in surprising incidents, no matter how many root causes you identify and eliminate. Consider trying the RE approach. After all, you’re going to need all of the fault tolerance you can get.

Labeling a root cause is predicting the future, poorly

Why do we retrospect on our incidents? Why spend the time doing those write-ups and holding review meetings? We don’t do this work as some sort of intellectual exercise for amusement. Rather, we believe that if we spend the time to understand how the incident happened, we can use that insight to improve the system in general, and availability in particular. We improve availability by preventing incidents as well as reducing the impact of incidents that we are unable to prevent. This post-incident work should help us do both.

The typical approach to post-incident work is to do a root cause analysis (RCA). The idea of an RCA is to go beyond the surface-level symptoms to identify and address the underlying problems revealed by the incident. After all, it’s only by getting at the root at the problem that we will be able to permanently address it. When doing an RCA, when we attach the label root cause to something, we’re making a specific claim. That claim is: we should focus our attention on the issues that we’ve labeled “root cause”, because spending our time addressing these root causes will yield the largest improvements to future availability. Sure, it may be that there were a number of different factors involved in the incident, but we should focus on the root cause (or, sometimes, a small number of root causes), because those are the ones that really matter. Sure, the fact that Joe happened to be on PTO that day, and he’s normally the one that spots these sorts of these problems early, that’s interesting, but it isn’t the real root cause.

Remember that an RCA, like all post-incident work, is supposed to be about improving future outcomes. As a consequence, a claim about root cause is really a prediction about future incidents. It says that of all of the contributing factors to an incident, we are able to predict which factor is most likely to lead to an incident in the future. That’s quite a claim to make!

Here’s the thing, though. As our history of incidents teaches us over and over again, we aren’t able to predict how future incidents will happen. Sure, we can always tell a compelling story of why an incident happened, through the benefit of hindsight. But that somehow never translates into predictive power: we’re never able to tell a story about the next incident the way we can about the last one. After all, if we were as good at prediction as we are at hindsight, we wouldn’t have had that incident in the first place!

A good incident retrospective can reveal a surprisingly large number of different factors that contributed to the incident, providing signals for many different kinds of risks. So here’s my claim: there’s no way to know which of those factors is going to bite you next. You simply don’t possess a priori knowledge about which factors you should pay more attention to at the time of the incident retrospective, no matter what the vibes tell you. Zeroing in on a small number of factors will blind you to the role that the other factors might play in future incidents. Today’s “X wasn’t the root cause of incident A” could easily be tomorrow’s “X was the root cause of incident B”. Since you can’t predict which factors will play the most significant roles in future incidents, it’s best to cast as wide a net as possible. The more you identify, the more context you’ll have about the possible risks. Heck, maybe something that only played a minor role in this incident will be the trigger in the next one! There’s no way to know.

Even if you’re convinced that you can identify the real root cause of the last incident, it doesn’t actually matter. The last incident already happened, there’s no way to prevent it. What’s important is not the last incident, but the next one: we’re looking at the past only as a guide to help us improve in the future. And while I think incidents are inherently unpredictable, here’s a prediction I’m comfortable making: your next incident is going to be a surprise, just like your last one was, and the one before that. Don’t fool yourself into thinking otherwise.

On work processes and outcomes

Here’s a stylized model of work processes and outcomes. I’m going to call it “Model I”.

Model I: Work process and outcomes

If you do work the right way, that is, follow the proper processes, then good things will happen. And, when we don’t, bad things happen. I work in the software world, so by “bad outcome” a mean an incident, and by “doing the right thing”, the work processes typically refer to software validation activities, such as reviewing pull requests, writing unit tests, manually testing in a staging environment. But it also includes work like adding checks in the code for unexpected inputs, ensuring you have an alert defined to catch problems, having someone else watching over your shoulder when you’re making a risky operational change, not deploying your production changes on a Friday, and so on. Do this stuff, and bad things won’t happen. Don’t do this stuff, and bad things will.

If you push someone who believes in this model, you can get them to concede that sometimes nothing bad happens even though someone didn’t do everything can quite right, the amended model looks like this:

Inevitably, an incident happens. At that point, we focus the post-incident efforts on identifying what went wrong with the work. What was the thing that was done wrong? Sometimes, this is individuals who weren’t following the process (deployed on a Friday afternoon!). Other times, the outcome of the incident investigation is a change in our work processes, because the incident has revealed a gap between “doing the right thing” and “our standard work processes”, so we adjust our work processes to close the gap. For example, maybe we now add an additional level of review and approval for certain types of changes.


Here’s an alternative stylized model of work processes and outcomes. I’m going to call it “Model II”.

Model II: work processes and outcomes

Like our first model, this second model contains two categories of work processes. But the categories here are different. They are:

  1. What people are officially supposed to
  2. What people actually do

The first categorization is an idealized view of how the organization thinks that people should do their work. But people don’t actually do their work their way. The second category captures what the real work actually is.

This second model of work and outcomes has been embraced by a number of safety researchers. I deliberately called my models as Model I and Model II as a reference to Safety-I and Safety-II. Safety-II is a concept developed by the resilience engineering researcher Dr. Erik Hollnagel. The human factor experts Dr. Todd Conklin and Bob Edwards describe this alternate model using a black-line/blue-line diagram. Dr. Steven Shorrock refers to the first category as work-as-prescribed, and the second category as work-as-done. In our stylized model, all outcomes come from this second category of work, because it’s the only one that captures the actual work that leads to any of the outcomes. (In Shorrock’s more accurate model, the two categories of work overlap, but bear with me here).

This model makes some very different assumptions about the nature of how incidents happen! In particular, it leads to very different sorts of questions.

The first model is more popular because it’s more intuitive: when bad things happen, it’s because we did things the wrong way, and that’s when we look back in hindsight to identify what those wrong ways were. The second model requires us to think more about the more common case when incidents don’t happen. After all, we measure our availability in 9s, which means the overwhelming majority of the time, bad outcomes aren’t happening. Hence, Hollnagel encourages us to spend more time examining the common case of things going right.

Because our second model assumes that what people actually do usually leads to good outcomes, it will lead to different sorts of questions after an incident, such as:

  1. What does normal work look like?
  2. How is it that this normal work typically leads to successful outcomes?
  3. What was different in this case (the incident) compared to typical cases?

Note that this second model doesn’t imply that we should always just keep doing things the same way we always do. But it does imply that we should be humble in enforcing changes to the way work is done, because the way that work is done today actually leads to good outcomes most of the time. If you don’t understand how things normally work well, you won’t see how your intervention might make things worse. Just because your last incident was triggered by a Friday deploy doesn’t mean that banning Friday deploys will lead to better outcomes. You might actually end up making things worse.

Good models protect us from bad models

One of the criticisms leveled at resilience engineering is that the insights that the field generates aren’t actionable: “OK, let’s say you’re right, that complex systems are never perfectly understood, they’re always changing, they generate unexpected interactions, and that these properties explain why incidents happen. That doesn’t tell me what I should do about it!”

And it’s true; I can talk generally about the value of improving expertise so that we’re better able to handle incidents. But I can’t take the model of incidents that I’ve built based on my knowledge of resilience engineering and turn that into a specific software project that you can build and deploy that will eliminate a class of incidents.

But even if these insights aren’t actionable, that they don’t tell us about a single thing we can do or build to help improve reliability, my claim here is that these insights still have value. That’s because we as humans need models to make sense of the world, and if we don’t use good-but-not-actionable models, we can end up with actionable-but-not-good models. Or, as the statistics professor Andrew Gelman put it in his post The social sciences are useless. So why do we study them? Here’s a good reason back in 2021:

The baseball analyst Bill James once said that the alternative to good statistics is not no statistics, it’s bad statistics. Similarly, the alternative to good social science is not no social science, it’s bad social science.

The reason we do social science is because bad social science is being promulgated 24/7, all year long, all over the world. And bad social science can do damage.

Because we humans need models to make sense of the world, incidents models are inevitable. A good-but-not-actionable incident model will feel unsatisfying to people who are looking to leverage these models to take clear action. And it’s all too easy to build not-good-but-actionable models of how incidents happen. Just pick something that you can measure and that you theoretically have control over. The most common example of such a model is the one I’ll call “incidents happen because people don’t follow the processes that they are supposed to.” It’s easy to call out process violations in incident writeups, and it’s easy to define interventions to more strictly enforce processes, such as through automation.

In other words, good-but-not-actionable models protect us from the actionable-but-not-good models. They serve as a kind of vaccine, inoculating us from the neat, plausible, and wrong solutions that H.L. Mencken warned us about.

Model error

One of the topics I wrote about in my last post was about using formal methods to build a model of how our software behaves. In this post, I want to explore how the software we write itself contains models: models of how the world behaves.

The most obvious area is in our database schemas. These schemas enable us to digitally encode information about some aspect of the world that our software cares about. Heck, we even used to refer to this encoding of information into schemas as data models. Relational modeling is extremely flexible: in principle, we can represent just about any aspect of the world into it, if we put enough effort in. The challenge is that the world is messy, and this messiness significantly increases the effort required to build more complete models. Because we often don’t even recognize the degree of messiness the real world contains, we build over-simplified models that are too neat. This is how we end up with issues like the ones captured in Patrick McKenzie’s essay Falsehoods Programmers Believe About Names. There’s a whole book-length meditation on the messiness of real data and how it poses challenges for database modeling: Data and Reality by William Kent, which is highly recommended by Hillel Wayne, in his post Why You Should Read “Data and Reality”.

The problem of missing the messiness of the real world is not at all unique to software engineers. For example, see Christopher Alexander’s A City Is Not a Tree for a critique of urban planners’ overly simplified view of human interactions in urban environments. For a more expansive lament, check out James C. Scott’s excellent book Seeing Like a State. But, since I’m a software engineer and not an urban planner or a civil servant, I’m going to stick to the software side of things here.

Models in the back, models in the front

In particularly, my own software background is in the back-end/platform/infrastructure space. In this space, the software we write frequently implement control systems. It’s no coincidence that both cybernetics and kubernetes derived their names from the same ancient Greek word: κυβερνήτης. Every control system must contain within it a model of the system that it controls. Or, as Roger C. Conant and W. Ross Ashby put it, every good regulator of a system must be a model of that system.

Things get even more complex on the front-end side of the software world. This world must bridge the software world with the human world. In the context of Richard Cook’s framing in Above the Line, Below the Line, the front-end is the line that bridges the two world. As a consequence, the front-end’s responsibility is to expose a model of the software’s internal state to the user. This means that the front-end also has an implicit model of the users themselves. In the paper Cognitive Systems Engineering: New wine in new bottles, Erik Hollnagel and David Woods referred to this model as the image of the operator.

The dangers of the wrongness of models

There’s an oft-repeated quote by the statistician George E.P. Box: “All models are wrong but some are useful”. It’s a true statement, but one that focuses only on the upside of wrong models, the fact that some of them are useful. There’s also a downside to the fact that all models are wrong: the wrongness of these models can have drastic consequences.

And, while It’s a true statement, but what it fails to hint at how bad the consequences can be when a model is wrong. One of my favorite examples involves the 2008 financial crisis, as detailed by the journalist Felix Salmon’s 2009 Wired Magazine article Recipe for Disaster: The Formula that Killed Wall Street. The article described how Wall Street quants used a mathematical model known as the Gaussian copula function to estimate risk. It was a useful model that ultimately led to disaster.

Here’s A ripped-from-the-headlines example of image of the operator model error, how the U.S. national security advisor Mike Waltz accidentally saved the phone number of Jeffrey Goldberg, editor of the Atlantic magazine, to the contact information of White House spokesman Brian Hughes. The source is the recent Guardian story How the Atlantic’s Jeffrey Goldberg got added to the White House Signal group chat:

According to three people briefed on the internal investigation, Goldberg had emailed the campaign about a story that criticized Trump for his attitude towards wounded service members. To push back against the story, the campaign enlisted the help of Waltz, their national security surrogate.

Goldberg’s email was forwarded to then Trump spokesperson Brian Hughes, who then copied and pasted the content of the email – including the signature block with Goldberg’s phone number – into a text message that he sent to Waltz, so that he could be briefed on the forthcoming story.

Waltz did not ultimately call Goldberg, the people said, but in an extraordinary twist, inadvertently ended up saving Goldberg’s number in his iPhone – under the contact card for Hughes, now the spokesperson for the national security council.

According to the White House, the number was erroneously saved during a “contact suggestion update” by Waltz’s iPhone, which one person described as the function where an iPhone algorithm adds a previously unknown number to an existing contact that it detects may be related.

The software assumed that, when you receive a text from someone with a phone number and email address, that the phone number and email address belong to the sender. This is a model of the user that turned out to be very, very wrong.

Nobody expects model error

Software incidents involve model errors in one way or another, whether it’s an incorrect model of the system being controlled, an incorrect image of the operator, or a combination of the two.

And, yet, despite us all intoning “all models are wrong, some models are useful”, we don’t internalize that our systems our built on top of imperfect models. This is one of the ironies of AI: we are now all aware of the risks associated with model error with LLMs. We’ve even come up with a separate term for it: hallucinations. But traditional software is just as vulnerable to model error as LLMs are, because our software is always built on top of models that are guaranteed to be incomplete.

You’re probably familiar with the term black swan, popularized by the acerbic public intellectual Nassim Nicholas Taleb. While his first book, Fooled by Randomness, was a success, it was the publication of The Black Swan that made Taleb a household name, and introduced the public to the concept of black swans. While the term black swan was novel, the idea it referred to was not. Back in the 1980s, the researcher Zvi Lanir used a different term: fundamental surprise. Here’s an excerpt of a Richard Cook lecture on the 1999 Tokaimura nuclear accident where he talks about this sort of surprise (skip to the 45 minute mark).

And this Tokaimura accident was an impossible accident.

There’s an old joke about the creator of the first English American dictionary, Noah Webster … coming home to his house and finding his wife in bed with another man. And she says to him, as he walks in the door, she says, “You’ve surprised me”. And he says, “Madam, you have astonished me”.

The difference was that she of course knew what was going on, and so she could be surprised by him. But he was astonished. He had never considered this as a possibility.

And the Tokaimura was an astonishment or what some, what Zev Lanir and others have called a fundamental surprise which means a surprise that is fundamental in the sense that until you actually see it, you cannot believe that it is possible. It’s one of those “I can’t believe this has happened”. Not, “Oh, I always knew this was a possibility and I’ve never seen it before” like your first case of malignant hyperthermia, if you’re a an anesthesiologist or something like that. It’s where you see something that you just didn’t believe was possible. Some people would call it the Black Swan.

Black swans, astonishment, fundamental surprise, these are all synonyms for model error.

And these sorts of surprises are going to keep happening to us, because our models are always wrong. The question is: in the wake of the next incident, will we learn to recognize that fundamental surprises will keep happening to us in the future? Or will we simply patch up the exposed problems in our existing models and move on?

Models, models every where, so let’s have a think

If you’re a regular reader of this blog, you’ll have noticed that I tend to write about two topics in particular:

  1. Resilience engineering
  2. Formal methods

I haven’t found many people who share both of these interests.

At one level, this isn’t surprising. Formal methods people tend to have an analytic outlook, and resilience engineering people tend to have a synthetic outlook. You can see the clear distinction between these two perspectives in the transcript of Leslie Lamport’s talk entitled The Future of Computing: Logic or Biology. Lamport is clearly on the side of the logic, so much so that he ridicules the very idea of taking a biological perspective on software systems. By contrast, resilience engineering types actively look to biology for inspiration on understanding resilience in complex adaptive systems. A great example of this is the late Richard Cook’s talk on The Resilience of Bone.

And yet, the two fields both have something in common: they both recognize the value of creating explicit models of aspects of systems that are not typically modeled.

You use formal methods to build a model of some aspect of your software system, in order to help you reason about its behavior. A formal model of a software system is a partial one, typically only a very small part of the system. That’s because it takes effort to build and validate these models: the larger the model, the more effort it takes. We typically focus our models on a part of the system that humans aren’t particularly good at reasoning about unaided, such as concurrent or distributed algorithms.

The act of creating and explicit model and observing its behavior with a model checker gives you a new perspective on the system being modeled, because the explicit modeling forces you to think about aspects that you likely wouldn’t have considered. You won’t say “I never imagined X could happen” when building this type of formal model, because it forces you to explicitly think about what would happen in situations that you can gloss over when writing a program in a traditional programming language. While the scope of a formal model is small, you have to exhaustively specify the thing within the scope you’ve defined: there’s no place to hide.

Resilience engineering is also concerned with explicit models, in two different ways. In one way, resilience engineering stresses the inherent limits of models for reasoning about complex systems (c.f., itsonlyamodel.com). Every model is incomplete in potentially dangerous ways, and every incident can be seen through the lens of model error: some model that we had about the behavior of the system turned out to be incorrect in a dangerous way.

But beyond the limits of models, what I find fascinating about resilience engineering is the emphasis on explicitly modeling aspects of the system that are frequently ignored by traditional analytic perspectives. Two kinds of models that come up frequently in resilience engineering are mental models and models of work.

A resilience engineering perspective on an incident will look to make explicit aspects of the practitioners’ mental models, both in the events that led up to that incident, and in the response to the incident. When we ask “How did the decision make sense at the time?“, we’re trying to build a deeper understanding of someone else’s state of mind. We’re explicitly trying to build a descriptive model of how people made decisions, based on what information they had access to, their beliefs about the world, and the constraints that they were under. This is a meta sort of model, a model of a mental model, because we’re trying to reason about how somebody else reasoned about events that occurred in the past.

A resilience engineering perspective on incidents will also try to build an explicit model of how work happens in an organization. You’ll often heard the short-hand phrase work-as-imagined vs work-as-done to get at this modeling, where it’s the work-as-done that is the model that we’re after. The resilience engineering perspective asserts that the documented processes of how work is supposed to happen is not an accurate model of how work actually happens, and that the deviation between the two is generally successful, which is why it persists. From resilience engineering types, you’ll hear questions in incident reviews that try to elicit some more details about how the work really happens.

Like in formal methods, resilience engineering models only get at a small part of the overall system. There’s no way we can build complete models of people’s mental models, or generate complete descriptions of how they do their work. But that’s ok. Because, like the models in formal methods, the goal is not completeness, but insight. Whether we’re building a formal model of a software system, or participating in a post-incident review meeting, we’re trying to get the maximum amount of insight for the modeling effort that we put in.

Resilience: some key ingredients

Brian Marick posted on Mastodon the other day about resilience in the context of governmental efficiency. Reading that inspired me to write about some more general observations about resilience.

Now, people use the term resilience in different ways. I’m using resilience here in the following sense: how well a system is able to cope when it is pushed beyond its limits. Or, to borrow a term from safety researcher David Woods, when the system is pushed outside of its competence envelope. The technical term for this sense of the word resilience is graceful extensibility, which also comes from Woods. This term is a marriage of two other terms: graceful degradation, and software extensibility.

The term graceful degradation refers to the behavior of a system which, when it experiences partial failures, can still provide some functionality, even though it’s at a reduced fidelity. For example, for a web app, this might mean that some particular features are unavailable, or that some percentage of users are not able to access the site. Contrast this with a system that just returns 500 errors for everyone whenever something goes wrong.

We talk about extensible software systems as ones that have been designed to make it easy to add new features in the future that were not originally anticipated. A simple example of software extensibility is the ability for old code to call new code, with dynamic binding being one way to accomplish this.

Now, putting those two concepts together, if a system encounters some sort of shock that it can’t handle, and the system has the ability to extend itself so that it can now handle the shock, and it can make these changes to itself quickly enough that it minimizes the harms resulting from the shock, then we say the system exhibits graceful extensibility. And if it can keep extending itself each time it encounters a novel shock, then we say that the system exhibits sustained adaptability.

The rest of this post is about the preconditions for resilience. I’m going to talk about resilience in the context of dealing with incidents. Note that all of the topics described below come from the resilience engineering literature, although I may not always use the same terminology.

Resources

As Brian Marick observed in his toot:

As we discovered with Covid, efficiency is inversely correlated with resilience.

Here’s a question you can ask anyone who works in the compute infrastructure space: “How hot do you run your servers?” Or, even more meaningfully, “How much headroom do your servers have?”

Running your servers “hotter” means running at a higher CPU utilization. This means that you pack more load on fewer servers, which is more efficient. The problem is that the load is variable, which means that the hotter you run the servers, the more likely your server will get overloaded if there is a spike in utilization. An overloaded server can lead to an incident, and incidents are expensive! Running your servers at maximum utilization is running with zero headroom. We deliberately run our servers with some headroom to be able to handle variation in load.

We also see the idea of spare resources in what we call failover scenarios, where there’s a failure in one resource so we switch to using a different resource, such as failing over a database from primary to secondary, or even failing out of a geographical region.

The idea of spare resources is more general than hardware. It applies to people as well. The equivalent of headroom for humans is what Tom DeMarco refers to as slack. The more loaded humans are, the less well positioned they are to handle spikes in their workload. Stuff falls through the cracks when you’ve got too much load, and some of that stuff contributes to incidents. We can also even keep people in reserve for dealing with shocks, such as when an organization staffs a dedicated incident management team.

A common term that the safety people use for spare resources is capacity. I really like the way Todd Conklin put it on his Pre-Accident Investigation Podcast: “You don’t manage risk. You manage the capacity to absorb risk.” Another way he put it is “Accidents manage you, so what you really manage is the capacity for the organization to fail safely.”

Flexibility

Here’s a rough and ready definition of an incident: the system has gotten itself into a bad state, and it’s not going to return to a good state unless somebody does something about it.

Now, by this definition, for the system to become healthy again something about how the system works has to change. This means we need to change the way we do things. The easier it is to make changes to the system, the easier it will be to resolve the incident.

We can think of two different senses of changing the work of the system: the human side and the the software side.

Humans in a system are constrained by a set of rules that exist to reduce risk. We don’t let people YOLO code from their laptops into production, because of a number of risks that would expose us to. But incidents create scenarios where the risks associated with breaking these rules are lower than the risks associated with prolonging the incident. As a consequence, people in the system need the flexibility to be able to break the standard rules of work during an incident. One way to do this is to grant incident responders autonomy, let them make judgments about when they are able to break the rules that govern normal work, in scenarios where breaking the rule is less risky than following it.

Things look different on the software side, where all of the rules are mechanically enforced. For flexibility in software, we need to build into the software functionality in advance that will let us change the way the system behaves. My friend Aaron Blohowiak uses the term Jefferies tubes from Star Trek to describe features that support making operational changes to a system. These were service crawlways that made it easier for engineers to do work on the ship.

A simple example of this type of operational flexibility is putting in feature flags that can be toggled dynamically in order to change system behavior. At the other extreme is the ability to bring up a REPL on a production system in order to make changes. I’ve seen this multiple times in my career, including watching someone use the rails console command of a Ruby on Rails app to resolve an issue.

The technical term in resilience engineering for systems that possess this type of flexibility is adaptive capacity: the system has built up the ability to be able to dynamically reconfigure itself, to adapt, in order to meet novel challenges. This is where the name Adaptive Capacity Labs comes from.

Expertise

In general, organizations push against flexibility because it brings risk. In the case where I saw someone bring up a Ruby on Rails console, I was simultaneously impressed and terrified: that’s so dangerous!

Because flexibility carries risk, we need to rely on judgment as to whether the risk of leveraging the flexibility outweighs the risk of not using the flexibility to mitigate the incident. Granting people the autonomy to make those judgment calls isn’t enough: the people making the calls need to be able to make good judgment calls. And for that, you need expertise.

The people making these calls are having to make decisions balancing competing risks while under uncertainty and time pressure. In addition, how fluent they are with the tools is a key factor. I would never trust a novice with access to a REPL in production. But an expert? By definition, they know what they’re doing.

Diversity

Incidents in complex systems involve interactions between multiple parts of the system, and there’s no one person in your organization who understands the whole thing. To be able to effectively know what to do during an incident, you need to bring in different people who understand different parts of the system in order to help figure out what happens. You need diversity in your responders, people with different perspective on the problem at hand.

You also want diversity in diagnostic and mitigation strategy. Some people might think about recent changes, others might think about traffic pattern changes, others might dive into the codebase looking for clues, and yet others might look to see if there’s another problem going on right now that seems to be related. In addition, it’s often not obvious what the best course of action is to mitigate an incident. Responders often pursue multiple courses of action in parallel, hoping that at least one of them will bring the system healthy again. A diversity of perspectives can help generate more potential interventions, reducing the time to resolve.

Coordination

Having a group of experts with a diverse set of perspectives by itself isn’t enough to deal with an incident. For a system to be resilient, the people within the system need to be able to coordinate, to work together effectively.

If you’ve ever dealt with a complex incident, you know how challenging coordination can be. Things get even hairier in our distributed world. Whether you’re physically located with all of the responders, you’re on a Zoom call (a bridge, as we still say), you’re messaging over Slack, or some hybrid combination of all three, each type of communication channel has its benefits and drawbacks.

There are prescriptive approaches to improving coordination during incidents, such as the Incident Command System (ICS). However, Laura Maguire’s research has shown that, in practice, incident responders intentionally deviate from ICS to better manage coordination costs. This is yet another example of flexibility and expertise being employed to deal with an incident.


The next time you observe an incident, or you reflect on an incident where you were one of the responders, think back on to what extent these ingredients were present or absent. Were you able to leverage spare resources, or did you suffer from not being to? Were there operational changes that people wanted to be able to make during the incident, and were they actually able to make them? Were the responders experienced with the sub-systems they were dealing with, and how did that shape their responses? Did different people come up with different hypotheses and strategies? What is it clear to you what the different responders were doing during the incident? These issues are easy to miss if you’re not looking for them. But, once you internalize them, you’ll never be able to unsee them.