Fixation: the ever-present risk during incident handling

Recent U.S. headlines have been dominated by school shootings. The bulk of the stories have been about the assassination of Charlie Kirk on the campus of Utah Valley University and the corresponding political fallout. On the same day, there was also a shooting at Evergreen High School in Colorado, where a student shot and injured two of his peers. This post isn’t about those school shootings, but rather, one that happened three years ago. On May 24, 2022, at Robb Elementary School in Uvalde, Texas, 19 students and 2 teachers were killed by a shooter who managed to make his way onto the campus.

Law enforcement were excoriated for how they responded to the Uvalde shooting incident: several were fired, and two were indicted on charges of child endangerment. On January 18, 2024, the Department of Justice released the report on their investigation of the shooting: Critical Incident Review: Active Shooter at Robb Elementary School. According to the report, there were multiple things that went wrong during the incident. Most significantly, the police originally believed that the shooter had barricaded himself in an empty classroom, where in fact shooter was in a classroom with students. There were also communication issues that resulted in a common ground breakdown during the response. But what I want to talk about in this post is the keys.

The search for the keys

During the response to the Uvalde shooting, there was significant effort by the police on the scene to locate master keys to unlock rooms 111/112 (numbered p14, PDF p48, emphasis mine).

Phase III of the timeline begins at 12:22 p.m., immediately following four shots fired inside classrooms 111 and 112, and continues through the entry and ensuing gunfight at 12:49 p.m. During this time frame, officers on the north side of the hallway approach the classroom doors and stop short, presuming the doors are locked and that master keys are necessary.

The search for keys started before this, because room 109 was locked, and had children in it, and the police wanted to evacuate those children (numbered p 13, PDF p48):

By approximately 12:09 p.m., all classrooms in the hallways have been evacuated and/or cleared except rooms 111/112, where the subject is, and room 109. Room 109 is found to be locked and believed to have children inside.

If you look at the Minute-by-Minute timeline section of the report (numbered p17, PDF p50) you’ll see the text “Events: Search for Keys” appear starting at 12:12 PM, all of the way until 12:45 PM.

The irony here is that the door to room 111/112 may have never been locked to begin with, as suggested by the following quote (numbered p15, PDF p48), emphasis mine:

At around 12:48 p.m., the entry team enters the room. Though the entry team puts the key in the door, turns the key, and opens it, pulling the door toward them, the [Critical Incident Review] Team concludes that the door is likely already unlocked, as the shooter gained entry through the door and it is unlikely that he locked it thereafter.

Ultimately, the report explicitly calls out how the search for the keys led to delays in response (numbered p xxviii, PDF p30):

Law enforcement arriving on scene searched for keys to open interior doors for more than 40 minutes. This was partly the cause of the significant delay in entering to eliminate the threat and stop the killing and dying inside classrooms 111 and 112. (Observation 10)

Fixation

In hindsight, we can see that the responders got something very important wrong in the moment: they were searching for keys for a door that probably wasn’t even locked. In this specific case, there appears to have been some communicated-related confusion about the status of the door, as shown by the following (numbered p53, PDF p86):

The BORTAC [U.S. Border Patrol Tactical Unit] commander is on the phone, while simultaneously asking officers in the hallway about the status of the door to classrooms 111/112. UPD Sgt. 2 responds that they do not know if the door is locked. The BORTAC commander seems to hear that the door is locked, as they say on the phone, “They’re saying the door is locked.” UPD Sgt. 2 repeats that they do not know the status of the door.

More generally, this sort of problem is always going to happen during incidents: we are forever going to come to conclusions during an incident about what’s happening that turn out to be wrong in hindsight. We simply can’t avoid that, no matter how hard we try.

The problem I want to focus on here is not the unavoidable getting it wrong in the moment, but the actually-preventable problem of fixation. We “fixate” when we focus solely on one specific aspect of the situation. The problem here is not searching for keys, but on searching for keys to the exclusion of other activities.

During complex incidents, the underlying problem is frequently not well understood, and so the success of a proposed mitigation strategy is almost never guaranteed. Maybe a rollback will fix things, but maybe it won’t! The way to overcome this problem is to pursue multiple strategies in parallel. One person or group focuses on rolling back a deployment that aligns in time, another looks for other types of changes that occurred around the same time, yet another investigates the logs, another looks into scaling up the amount of memory, someone else investigates traffic pattern changes, and so on. By pursuing multiple diagnostic and mitigation strategies in parallel, we reduce the risk of delaying the mitigation of the incident by blocking on the investigation of one avenue that may turn out to not be fruitful.

Doing this well requires diversity of perspectives and effective coordination. You’re more likely to come up with a broader set of options to pursue if your responders have a broader range of experiences. And the more avenues that you pursue, the more the coordination overhead increases, as you now need to keep the responders up to date about what’s going on in the different threads without overwhelming them with details.

Fixation is a pernicious risk because we’re more likely to fixate when we’re under stress. Since incidents are stressful by nature, they are effectively incubators of fixation. In the heat of the moment, it’s hard to take a breath, step back for a moment, understand what’s been tried already, and calmly ask about what the different possible options are. But the alternative is to tumble down the rabbit hole, searching for keys to a door that is already unlocked.

Nothing fails like a history of success

The Axiom of Experience: the future will be like the past, because, in the past, the future was like the past. – Gerald M. Weinberg, An Introduction to General Systems Thinking

Last Friday, the San Francisco Bay Area Rapid Transit system (known as BART) experienced a multiple hour outage. Later that day, the BART Deputy General Manager released a memo about the outage with some technical details. The memo is brief, but I was honestly surprised to see this amount of detail in a public document that was released so quickly after an incident, especially from a public agency. What I want to focus on in this post is this line (emphasis mine):

Specifically, network engineers were performing a cutover to a new network switch at
Montgomery St. Station… The team had already successfully performed eight similar cutovers earlier this year.

This reminded me of something I read in the Buildkite writeup from an incident that happened back in January of this year (emphasis mine):

Given the confidence gained by initial load testing and the migrations already performed over the past year, we wanted to allow customers to take advantage of their seasonal low periods to perform shard migrations, as a win-win. This caused us to discount the risk of performing migrations during a seasonal low period and what impacts might emerge when regular peak traffic returned.

It also reminded me about the 2022 Rogers Telecommunications outage in Canada (emphasis mine, [redacted] comments in the original):

Rogers had assessed the risk for the initial change of this seven-phased process as “High”. Subsequent changes in the series were listed as “Medium.” [redacted] was “Low” risk based on the Rogers algorithm that weighs prior success into the risk assessment value. Thus, the risk value for [redacted] was reduced to “Low” based on successful completion of prior changes.

Whenever we make any sort of operational change, we have a mental model of the risk associated with the change. We view novel changes (I’ve never done something like this before!) as riskier than changes we’ve performed successfully multiple times in the past (I’ve done this plenty of times). I don’t think this sort of thinking is a fallacy: rather, it’s a heuristic, and it’s generally a pretty effective one! But, like all heuristics, it isn’t perfect. As shown in the examples above, the application of this heuristic can result in a miscalibrated mental model of the risk associated with a change.

So, what’s the broader lesson? In practice, our risk models (implicit or otherwise) are always miscalibrated: a history of past successes is just one of multiple avenues that can lead us astray. Trying to achieve a perfect risk model is like trying to deploy software that is guaranteed to have zero bugs: it’s never going to happen. Instead, we need to accept the reality that, like our code, our models of risk will always have defects that are hidden from us until it’s too late. So we’d better get damned good at recovery.

The problems that accountability can’t fix

Accountability is a mechanism that achieves better outcomes by aligning incentives, in particular, negative ones. Specifically: if you do a bad thing, or fail to do a good thing, under your sphere of control, then bad things will happen to you. I recently saw several LinkedIn posts that referenced the U.S. Coast Guard report on the OceanGate experimental submarine implosion. These posts described how this incident highlights the importance of accountability in leadership. And, indeed, the report itself references accountability five times.

However, I think this incident is an example of a type of problem where accountability doesn’t actually help. Here I want to talk about two classes of problems where accountability is a poor solution to addressing the problem, where the OceanGate accident falls into the second class.

Coordination challenges

Managing a large organization is challenging. Accountability is a popular tool in such organizations to ensure that work actually gets done, by identifying someone who is designated as the stuckee for ensuring that a particular task or project gets completed. I’ll call this top-down accountability. This kind of accountability is sometimes referred to, unpleasantly, as the “one throat to choke” model.

For this model to work, the problem you’re trying to solve needs to be addressable by the individual that is being held accountable for it. Where I’ve seen this model fall down is in post-incident work. As I’ve written about previously, I’m a believer in the resilience engineering model of complex systems failures, where incidents arise due to unexpected interactions between components. These are coordination problems, where the problems don’t live in one specific component, but, rather, how the components interact with each other.

But this model of accountability demands that we identify an individual to own the relevant follow-up incident work. And so it creates an incentive to always identify a root cause service, which is owned by the root cause team, who are then held accountable for addressing the issue.

Now, just because you have a coordination problem, that doesn’t mean that you don’t need an individual to own driving the reliability improvements around it. In fact, that’s why technical project managers (known as TPMs) exist. They act as the accountable individuals for efforts that require coordination across multiple teams, and every large tech organization that I know of employs TPMs. The problem I’m highlighting here, such as in the case of incidents, is that accountability is applied as a solution without recognizing that the problem revealed by the incident is a coordination problem.

You can’t solve a coordination problem by identifying one of the agents involved in the coordination and making them accountable. You need someone who is well-positioned in the organization, recognizes the nature of the problem, and has the necessary skills to be the one who is accountable.

Miscalibrated risk models

The other way people talk about accountability is about holding leaders such as politicians and corporate executives responsible for their actions, where there are explicit consequences for them acting irresponsibly, including actions such as corruption, or taking dangerous risks with the people and resources that have been entrusted to them. I’ll call this bottom-up accountability.

The bottom-up accountability enforcement tool of choice in France, circa 1792

This brings us back to the OceanGate accident of June 18, 2023. In this accident, the TITAN submersible imploded, killing everyone aboard. One of the crewmembers who died was Stockton Rush, who was both pilot of the vessel and CEO of OceanGate.

The report is a scathing indictment of Rush. In particular, it criticizes how he sacrificed safety for his business goals, ran an organization that lacked that the expertise required to engineer experimental submersibles, promoted a toxic workplace culture that suppressed signs of trouble instead of addressing them, and centralized all authority in himself.

However, one thing we can say about Rush was that he was maximally accountable. After all, he was both CEO and pilot. He believed so much that TITAN was safe that he literally put his life on the line. As Nassim Taleb would put it, he had skin in the game. And yet, despite this accountability, he still took irresponsible risks, which led to disaster.

By being the pilot, Rush personally accepted the risks. But his actual understanding of the risk, his model of risk, was fundamentally incorrect. It was wrong, dangerously so.

Rush assessed the risk index of the fateful dive at 35. The average risk index of previous dives was 36.

Assigning accountability doesn’t help when there’s an expertise gap. Just as giving a software engineer a pager does not bestow up them the skills that they need to effectively do on-call operations work, having the CEO of OceanGate also be the pilot of the experimental vehicle did not lead to him being able to exercise better judgment about safety.

Rush’s sins weren’t merely lack of expertise, and the report goes into plenty of detail about his other management shortcomings that contributed to this incident. But, stepping back from the specifics of the OceanGate accident, there’s a greater point here that making executives accountable isn’t sufficient to avoid major incidents, if the risk models that executives use to make decisions are are out of whack with the actual risks. And by risk models here, I don’t just mean some sort of formal model like the risk assessment matrix above. Everyone carries with them an implicit risk model in their heads: this is a mental risk model.

Double binds

While the CEO also being a pilot sounds like it should be a good thing for safety (skin in the game!), it also creates a problem that the resilience engineering folks refer to as a double bind. Yes, Rush had strong incentives to ensure he wasn’t taking stupid risks, because otherwise he might die. But he also had strong incentives to keep the business going, and those incentives were in direct conflict with the safety incentives. But double-binds are not just an issue for CEO-pilots, because anyone in the organization will feel pressure from above to make decisions in support of the business, which may cut against safety. Accountability doesn’t solve the problem of double-binds, it exacerbates them, by putting someone on the hook for delivering.

Once again, from the resilience engineering literature, one way to deal with this problem is through cross-checks. For example, see the paper Collaborative Cross-Checking to Enhance Resilience by Patterson, Woods, Cook, and Render. Instead of depending on a single individual (accountability), you take advantage of the different perspectives of multiple people (diversity).

You also need someone who is not under a double-bind who has the authority to say “this is unsafe”. That wasn’t possible at OceanGate, where the CEO was all-powerful, and anybody who spoke up was silenced or pushed out.

On this note, I’ll leave you with a six-minute C-SPAN video clip from 2003. In this clip, the resilience engineering David Woods spoke at a U.S. Senate hearing in the wake of the Columbia accident. Here he was talking about the need for an independent safety organization at NASA as a mechanism for guarding against the risks that emerge from double binds.

(I could not get it to embed, here’s the link: https://www.c-span.org/clip/senate-committee/user-clip-david-woods-senate-hearing/4531343)

(As far as I know, the new independent safety organization that Woods proposed was not created)

Easy will always trump simple

One of the early criticisms of Darwin’s theory of evolution by natural selection was about how it could account for the development of complex biological structures. It’s often not obvious to us how the earlier forms of some biological organ would have increase fitness. “What use”, asked the 19th century English biologist St. George Jackson Mivart, “is half a wing?”

One possible answer is that while half a wing might not be useful for flying, it may have had a different function, and evolution eventually repurposed that half-wing for flight. This concept, that evolution can take some existing trait in an organism that serves a function and repurpose it to serve a different function, is called exaptation.

Biology seems to be quite good at using the resources that it has at hand in order to solve problems. Not too long ago, I wrote a review of the book How Life Works: A User’s Guide to the New Biology by the British science writer Philip Ball. One of the main themes of the book is how biologists’ view of genes has shifted over time from the idea DNA-as-blueprint to DNA-as-toolbox. Biological organisms are able to deal effectively with a wide range of challenges by having access to a broad set of tools, which they can deploy as needed based on their circumstances.

We’ll come back to the biology, but for a moment, let’s talk about software design. Back in 2011, Rich Hickey gave a talk at the (sadly defunct) Strange Loop conference with the title Simple Made Easy (transcript, video). In this talk, Hickey drew a distinction between the concepts of simple and easy. Simple is the opposite of complex, where easy is something that’s familiar to us: the term he used to describe the concept of easy that I really liked was at hand. Hickey argues that when we do things that are easy, we can initially move quickly, because we are doing things that we know how to do. However, because easy doesn’t necessarily imply simple, we can end up with unnecessarily complex solutions, which will slow us down in the long run. Hickey instead advocates for building simple systems. According to Hickey, simple and easy aren’t inherently in conflict, but are instead orthogonal. Simple is an absolute concept, and easy is relative to what the software designer already knows.

I enjoy all of Rich Hickey’s talks, and this one is no exception. He’s a fantastic speaker, and I encourage you to listen to it (there are some fun digs at agile and TDD in this one). And I agree with the theme of his talk. But I also think that, no matter how many people listen to this talk and agree with it, easy will always win out over simple. One reason is the ever-present monster that we call production pressure: we’re always under pressure to deliver our work within a certain timeframe, and easier solutions are, by definition, going to be ones that are faster to implement. That means the incentives on software developers tilts the scales heavily towards the easy side. Even more generally, though, easy is just too effective a strategy for solving problems. The late MIT mathematics professor Gian-Carlo Rota noted that every mathematician has only a few tricks, and that includes famous mathematicians like Paul Erdős and David Hilbert.

Let’s look at two specific examples of the application of easy from the software world, specifically, database systems. The first example is about knowledge that is at-hand. Richard Hipp implemented the SQLite v1 as a compiler that would translate SQL into byte code, because he had previous experience building compilers but not building database engines. The second example is about an exaptation, leveraging an implementation that was at-hand. Postgres’s support for multi-version concurrency control (MVCC) relies upon an implementation that was originally designed for other features, such as time-travel queries. (Multi-version support was there from the beginning, but MVCC was only added in version 6.5).

Now, the fact that we rely frequently on easy solutions doesn’t necessarily mean that they are good solutions. After all, the Postgres source I originally linked to has the title The Part of PostgreSQL We Hate the Most. Hickey is right that easy solutions may be fast now, but they will ultimately slow us down, as the complexity accretes in our system over time. Heck, one of the first journal papers that I published was a survey paper on this very topic of software getting more difficult to maintain over time. Any software developer that has worked at a company other than a startup has felt the pain of working with a codebase that is weighed down by what Hickey refers to in his talk as incidental complexity. It’s one of the reasons why startups can move faster than more mature organizations.

But, while companies are slowed down by this complexity, it doesn’t stop them entirely. What Hickey refers to in his talk as complected systems, the resilience engineering researcher David Woods refers to as tangled. In the resilience engineering view, Woods’s tangled, layered networks inevitably arise in complex systems.

Hickey points out that humans can only keep a small number of entities in their head at once, which puts a hard limit on our ability to reason about our systems. But the genuinely surprising thing about complex systems, including the ones that humans build, is that individuals don’t have to understand the system for them to work! It turns out that it’s enough for individuals to only understand parts of the system. Even without anyone having a complete understanding of the whole system, we humans can keep the system up and running, and even extend its functionality over time.

Now, there are scenarios when we do need to bring to bear an understanding of the system that is greater than any one person possesses. My own favorite example is when there’s an incident that involves an interaction between components, where no one person understands all of the components involved. But here’s another thing that human beings can do: we can work together to perform cognitive tasks that none of us could do on their own, and one such task is remediating an incident. This is an example of the power of diversity, as different people have different partial understandings of the system, and we need to bring those together.

To circle back to biology: evolution is terrible at designing simple systems: I think biological systems are the most complex systems that we humans have encountered. And yet, they work astonishingly well. Now, I don’t think that we should design software the way that evolution designs organisms. Like Hickey, I’m a fan of striving for simplicity in design. But I believe that complex systems, whether you call them complected or tangled, are inevitable, they’re just baked in to the fabric of the adaptive universe. I also believe that easy is such a powerful heuristic that it is also baked in to how we build and involved systems. That being said, we should be inspired, by both biology and Hickey, to have useful tools at-hand. We’re going to need them.

Component defects: RCA vs RE

Let’s play another round where contrast the root-cause-analysis (RCA) perspective to the resilience engineering (RE) perspective. Today’s edition is about the distribution of potentially incident-causing defects across the different components in the system. Here, I’m using RCA nomenclature, since the kinds of defects that an RCA advocate would refer to as a “cause” in the wake of an incident would be called a “contributor” by the RE folks.

Here’s a stylized view of the world from the RCA perspective:

RCA view of distribution of potential incident-causing defects in the system

Note that there are a few particularly problematic components: we should certainly focus our reliability efforts on figuring out which of the components we should be spending our reliability efforts on improving!

Now let’s look at the RE perspective:

RE view of distribution of potential incident-contributing defects in the system

It’s a sea of red! The whole system is absolutely shot through with defects that could contribute to an incident!

Under the RE view, the individual defects aren’t sufficient to cause an incident. Instead, it’s an interaction of these defects with other things, including other defects. Because incidents arise due to interactions, RE types will stress the importance of understanding interactions across components over the details of the specific component that happened to contain the defect that contributed to the outage. After all, according to RE folks, those defects are absolutely everywhere. Focusing on one particular component won’t yield significant improvements under this model.

If you want to appreciate the RE perspective, you need to develop an understanding how it can be that the system is up right now despite the fact that it is absolutely shot through with all of these potentially incident-causing defects, as the RCA folks would call them. As an RE type myself, I believe that your system is up right now, and that it already contains the defect that will be implicated in the next incident. After that incident happens, the tricky part isn’t identifying the defect, it’s appreciating how the defect alone wasn’t enough to bring it down.

“What went well” is more than just a pat on the back

When writing up my impressions of the GCP incident report, Cindy Sridharan’s tweet reminded me that I failed to comment on an important part of it, how the responders brought the overloaded system back to a healthy state.

Full report for yesterday’s Google outage

– not feature flagging a new code path
– null pointer crashing a binary when reading empty fields from Spanner
– remediation causing a thundering herd as there was no exponential backoff
– which required manual throttling to recover from pic.twitter.com/6AQ4BQk5ex
— Cindy Sridharan (@copyconstruct) June 14, 2025

Which brings me to the topic of this post: the “what went well” section of an incident write-up. Generally, public incident write-ups don’t have such sections. This is almost certainly for rational political reasons: it would be, well, gauche to recount to your angry customers about what a great job you did handling the incident. However, internal write-ups often have such sections, and that’s my focus here.

In my experience, “What went well” is typically the shortest section in the entire incident report, with a few brief bullet points that point out some positive aspects of the response (e.g., people responded quickly). It’s a sort of way-to-go!, a way to express some positive feedback to the responders on a job well done. This is understandable, as people believe that if we focus more on what went wrong than what went well, then we are more likely to improve the system, because we are focusing on repairing problems. This is why “what went wrong” and “what can we do to fix it” takes the lion’s share of the attention.

But the problem with this perspective is that it misunderstands the skills that are brought to bear during incident response, and how learning from a previously well-handled incident can actually help other responders do better in future incidents. Effective incident response happens because the responders are skilled. But every incident response team is an ad-hoc one, and just because you happened to have people with the right set of skills responding last time, doesn’t mean you’ll have the people with the right set the next time. This means that if you gloss over what went well, your next incident might be even worse than the last one, because you’ve described those future responders of the opportunity to learn from observing the skilled responders last time.

To make this more concrete, let’s look back at that the GCP incident report. In this scenario, the engineers had put in a red-button as a safety precaution and exercised it to remediate the audience.

As a safety precaution, this code change came with a red-button to turn off that particular policy serving path… Within 2 minutes, our Site Reliability Engineering team was triaging the incident. Within 10 minutes, the root cause was identified and the red-button (to disable the serving path) was being put in place.

However, that’s not the part that interests me so much. Instead, it’s the part about how the infrastructure became overloaded as a consequence of the remediation, and how the responders recovered from overload.

Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure…. It took up to ~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional databases to reduce the load.

This was not a failure scenario that they had explicitly designed for in advance of deploying the change: there was no red-button they could simply exercise to roll back the system to a non-overloaded state. Instead, they were forced to improvise a solution based on the controls that were available to them. In this case, they were able to reduce the load by turning down the rate of task creation, as well as by re-routing traffic away from the overloaded database.

And this sort of work is the really interesting bit an incident: how skilled responders are able to take advantage of generic functionality that is available in order to remediate an unexpected failure mode. This is one of the topics that the field of resilience engineering focuses on, how incident responders are able to leverage generic capabilities during a crunch. If I was an engineer at Google in this org, I would be very interested to learn what knobs are available and how to twist them. Describing this in detail in an incident write-up will increase my chances of being able to leverage this knowledge later. Heck, even just leaving bread crumbs in the doc will help, because I’ll remember the incident, look up the write-up, and follow the links.

Another enormously useful “what went well” aspect that often gets short shrift is a description of the diagnostic work: how the responders figured out what was going on. This never shows up in public incident write-ups, because the information is too proprietary, so I don’t blame Google for not writing about how the responders determined the source of the overload. But all too often these details are left out of the internal write-ups as well. This sort of diagnostic work is a crucial set of skills for incident response, and having the opportunity to read about how experts applied their skills to solve this problem help transfers these skills across the organization.

Here’s my claim: providing details on how things went well will reduce your future mitigation time even more than focusing on what went wrong. While every incident is different, the generic skills are common, and so getting better at response will get you more mileage than preventing repeats of previous incidents. You’re going to keep having incidents over and over. The best way to get better at incident handling is to handle more incidents yourself. The second best way is to watch experts handle incidents. The better you do at telling the stories of how your incidents were handled, the more people will learn about how to handle incidents.

Quick takes on the GCP public incident write-up

On Thursday (2025-06-12), Google Cloud Platform (GCP) had an incident that impacted dozens of their services, in all of their regions. They’ve already released an incident report (go read it!), and here are my thoughts and questions as I read it.

Note that the questions I have shouldn’t be explicitly seen as a critique as of the write-up, as the answers to the questions generally aren’t publicly shareable. They’re more in the “I wish I could be a fly on the wall inside of Google” questions.

Quick write-up

First, a meta-point: this is a very quick turnaround for a public incident write-up. As a consumer of these, I of course appreciate getting it faster, and I’m sure there was enormous pressure inside of the company to get a public write-up published as soon as possible. But I also think there are hard limits on how much you can actually learn about an incident when you’re on the clock like this. I assume that Google is continuing to investigate internally how the incident happened, and I hope that they publish another report several weeks from now with any additional details that they are able to share publicly.

Staging land mines across regions

Note that impact (June 12) happened two weeks after deployment (May 29).

This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code.

The system involved is called Service Control. Google stages their deploys of Service Control by region, which is a good thing: staging your changes is a way of reducing the blast radius if there’s a problem with the code. However, in this case, the problematic code path was not exercised during the regional rollout. Everything looked good in the first region, and so they deployed to the next region, and so on.

This the land mine risk: when the code you are rolling out contains a land mine which is not tripped during the rollout.

How did the decisions make sense at the time?

The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash.

This is the typical “we didn’t do X in this case and had we done X, this incident wouldn’t have happened, or wouldn’t have been as bad” sort of analysis that is very common in these write-ups. The problem with this is that it implies sloppiness on the part of the engineers, that important work was simply overlooked. We don’t have any sense on how the development decisions made sense at the time.

If this scenario was atypical (i.e., usually error handling and feature flags are added), what was different about this development case? We don’t have the context about what was going on during development, which means we (as external readers) can’t understand how this incident actually was enabled.

Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging.

How do they know it would have been caught in staging, if it didn’t manifest in production until two weeks after roll-out? Are they saying that adding a feature flag would have led to manual testing of the problematic code path in staging? Here I just don’t know enough about Google’s development processes to make sense of this observation.

Service Control did not have the appropriate randomized exponential backoff implemented to avoid [overloading the infrastructure].

As I discuss later, I’d wager it’s difficult to test for this in general, because the system generally doesn’t run in the mode that would exercise this. But I don’t have the context, so it’s just a guess. What’s the history behind Service Control’s backoff behavior? By definition, Without knowing its history, we can’t really understand how its backoff implementation came to be this way.

Red buttons and feature flags

As a safety precaution, this code change came with a red-button to turn off that particular policy serving path. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. (emphasis added)

Because I’m unfamiliar with Google’s internals, I don’t understand how their “red button” system works. In my experience, the “red button” type functionality is built on top of feature flag functionality, but that does not seem to be the case at Google, since here there was no feature flag, but there was a big red button.

It’s also interesting to me that, while this feature wasn’t feature-flagged it was big-red-buttoned. There’s a story here! But I don’t know what it is.

New feature: additional policy quota checks

On May 29, 2025, a new feature was added to Service Control for additional quota policy checks… On June 12, 2025 at ~10:45am PDT, a policy change was inserted into the regional Spanner tables that Service Control uses for policies.

I have so many questions.. What were these additional quota policy checks? What was the motivation for adding these checks (i.e., what problem are the new checks addressing)? Is this customer-facing functionality (e.g., GCP Cloud Quotas), or is this an internal-only? What was the purpose of the policy change that was inserted on June 12 (or was it submitted by a customer)? Did that policy change take advantage of the new Service Control features that were added on May 29? Was that the first policy change that happened since the new feature was deployed, or had there been others? How frequently do policy changes happen?

Global data changes

Given the global nature of quota management, this metadata was replicated globally within seconds.

While code and feature flag changes are staged across regions, apparently quota management metadata is designed to replicate globally.

Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues. (emphasis mine)

The implication I take from from the text was that there was a business requirement for quota management data changes to happen globally rather than staged, and that they are now going to push back on that.

What was the rationale for this business requirement? What are the tradeoffs involved in staging these changes versus having them happen globally? What new problems might arise when data changes are staged like this?

Are we going to be reading a GCP incident report in a few years that resulted from inconsistency of this data across regions due to this change?

Saturation!

Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure.

Here we have a classic example of saturation, where a database got overloaded. Note that saturation wasn’t the trigger here, but it made recovery more difficult. Our system is in a different mode during incident recovery than it is during normal mode, and it’s generally very difficult to test for how it will behave when it’s in recovery mode.

Does this incident match my conjecture?

I have a long-standing conjecture that once a system reaches a certain level of reliability, most major incidents will involve:

A manual intervention that was intended to mitigate a minor incident, or
Unexpected behavior of a subsystem whose primary purpose was to improve reliability

I don’t have enough information in this write-up to be able to make a judgment in this case: it depends on whether or not the quota management system’s purpose is to improve reliability. I can imagine it going either way. If it’s a public-facing system to help customers limit their costs, then that’s more of a traditional feature. On the other hand, if it’s to limit the blast radius of individual user activity, then that feels like a reliability improvement system.

What are the tradeoffs of the corrective actions?

The write-up lists seven bullets of corrective actions. The questions I always have of corrective actions are:

What are the tradeoffs involved in implementing these corrective actions?
How might they enable new failure modes or make future incidents more difficult to deal with?

AI at Amazon: a case study of brittleness

A year ago, Mihail Eric wrote a blog post detailing his experiences working on AI inside Amazon: How Alexa Dropped the Ball on Being the Top Conversational System on the Planet. It’s a great first-person account, with lots of detail of the issues that kept Amazon from keeping up with its peers in the LLM space. From my perspective, Eric’s post makes a great case study in what resilience engineering researchers refer to as brittleness, which is a term that the researchers use to refer to as a kind of opposite of resilience.

In the paper Basic Patterns in How Adaptive Systems Fail, the researchers David Woods and Matthieu Branlat note that brittle systems tend to suffer from the following three patterns:

Decompensation: exhausting capacity to adapt as challenges cascade
Working at cross-purposes: behavior that is locally adaptive but globally maladaptive
Getting stuck in outdated behaviors: the world changes but the system remains stuck in what were previously adaptive strategies (over-relying on past successes)

Eric’s post demonstrates how all three of these patterns were evident within Amazon.

Decompensation

It would take weeks to get access to any internal data for analysis or experiments…
Experiments had to be run in resource-limited compute environments. Imagine trying to train a transformer model when all you can get a hold of is CPUs. Unacceptable for a company sitting on one of the largest collections of accelerated hardware in the world.

If you’ve ever seen a service fall over after receiving a spike in external requests, you’ve seen a decompensation system failure. This happens when a system isn’t able to keep up with the demands that are placed upon on it.

In organizations, you can see the decompensation failure pattern emerge when decision-making is very hierarchical: you end up having to wait for the decision request to make its way up to someone who has the authority to make the decision, and then make its way down again. In the meantime, the world isn’t standing still waiting for that decision to be made.

As described in the Bad Technical Process section of Eric’s post, Amazon was not able to keep up with the rate at which its competitors were making progress on developing AI technology, even though Amazon had both the talent and the compute resources necessary in order to make progress. The people inside the organization who needed the resources weren’t able to get them in a timely fashion. That slowed down AI development and, consequently, they got lapped by their competitors.

Working at cross-purposes

Alexa’s org structure was decentralized by design meaning there were multiple small teams working on sometimes identical problems across geographic locales.

This introduced an almost Darwinian flavor to org dynamics where teams scrambled to get their work done to avoid getting reorged and subsumed into a competing team.

The consequence was an organization plagued by antagonistic mid-managers that had little interest in collaborating for the greater good of Alexa and only wanted to preserve their own fiefdoms.

My group by design was intended to span projects, whereby we found teams that aligned with our research/product interests and urged them to collaborate on ambitious efforts. The resistance and lack of action we encountered was soul-crushing.

Where decompensation is a consequence of poor centralization, working at cross-purposes is a consequence of poor decentralization. In a decentralized organization, the individual units are able to work more quickly, but there’s a risk of alignment: enabling everyone to row faster isn’t going to help if they’re rowing in different directions.

In the Fragmented Org Structures section of Eric’s writeup, he goes into vivid, almost painful detail about how Amazon’s decentralized org structure worked against them.

Getting stuck in outdated behaviors

Alexa was viciously customer-focused which I believe is admirable and a principle every company should practice. Within Alexa, this meant that every engineering and science effort had to be aligned to some downstream product.

That did introduce tension for our team because we were supposed to be taking experimental bets for the platform’s future. These bets couldn’t be baked into product without hacks or shortcuts in the typical quarter as was the expectation.

So we had to constantly justify our existence to senior leadership and massage our projects with metrics that could be seen as more customer-facing.

…

This introduced product/science conflict in every weekly meeting to track the project’s progress leading to manager churn every few months and an eventual sunsetting of the effort.

I’m generally not a fan of management books, but What got you here won’t get you there is a pretty good summary of the third failure pattern: when organizations continue to apply approaches that were well-suited to problems in the past but are ill-suited to problems in the present.

In the Product-Science Misalignment section of his post, Eric describes how Amazon’s traditional viciously customer-focused approach to development was a poor match for the research-style work that was required for developing AI. Rather than Amazon changing the way they worked in order to facilitate the activities of AI researchers, the researchers had to try to fit themselves into Amazon’s pre-existing product model. Ultimately, that effort failed.

I write mostly about software incidents on this blog, which are high-tempo affairs. But the failure of Amazon to compete effectively in the AI space, despite its head start with Alexa, its internal talent, and its massive set of compute resources, can also be viewed as a kind of incident. As demonstrated in this post, we can observe the same sorts of patterns in failures that occur in the span of months as we can in failures that occur in the span of minutes. How well Amazon is able to learn from this incident remains to be seen.

Dijkstra never took a biology course

Simplicity is prerequisite for reliability. — Edsger W. Dijkstra

Think about a system whose reliability had significantly improved over some period of time. The first example that comes to my mind is commercial aviation, but I’d encourage you to think of a software system you’re familiar with, either as a user (e.g., Google, AWS) or as a maintainer of a system that’s gotten more reliable over time.

Think of a system where the reliability trend looks like this

Now, for the system you have thought about where its reliability increased over time, think about what the complexity trend looks like over time for that system. I’d wager you’d see a similar sort of trend.

My claim about what the complexity trend looks like over time

Now, in general, increases in complexity don’t lead to increases in reliability. In some cases, engineers make a deliberate decision to trade off reliability for new capabilities. The telephone system today is much less reliable than it was when I was younger. As someone who grew up in the 80s and 90s, the phone system was so reliable that it was shocking to pick up the phone and not hear a dial tone. We were more likely to experience a power failure than a telephony outage, and the phones still worked when the power was out! I don’t think we even knew the term “dropped call”. Connectivity issues with cell phones are much more common than they ever were with landlines. But this was a deliberate tradeoff: we gave up some reliability in order to have ubiquitous access to a phone.

Other times, the increase in complexity isn’t the product of an explicit tradeoff but rather an entropy-like effect of a system getting more difficult to deal with over time as it accretes changes. This scenario, the one that most people have in mind when they think about increasing complexity in their system, is synonymous with the idea of tech debt. With tech debt the increase in complexity makes the system less reliable, because the risk of making a breaking change in the system has increased. I started this blog post with a quote from Dijkstra about simplicity. Here’s another one, along the same lines, from C.A.R. Hoare’s Turing Award Lecture in 1980:

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.

What Dijkstra and Hoare are saying is: the easier a software system is to reason about, the more likely it is to be correct. And this is true: when you’re writing a program, the simpler the program is, the more likely that you are to get it right. However, as we scale up from individual programs to systems, this principle breaks down. Let’s see how that happens.

Djikstra claims simplicity is a prerequisite for reliability. According to Dijkstra, if we encounter a system that’s reliable, it must be a simple system, because simplicity is required to achieve reliability.

reliability ⇒ simplicity

The claim I’m making in this post is the exact opposite: systems that improve in reliability do so by adding features that improves reliability, but come at the cost of increased complexity.

reliability ⇒ complexity

Look at classic works on improving the reliability of real-world systems like Michael Nygard’s Release It!, Joe Armstrong’s Making reliable distributed systems in the presence of software errors, and Jim Gray’s Why Do Computers Stop and What Can Be Done About It? and think about the work that we do to make our software systems more reliable, functionality like retries, timeouts, sharding, failovers, rate limiting, back pressure, load shedding, autoscaling, circuit breakers, transactions, and auxiliary systems we have to support our reliability work like an observability stack. All of this stuff adds complexity.

Imagine if I took a working codebase and proposed deleting all of the lines of code that are involved in error handling. I’m very confident that this deletion of code would make the codebase simpler. There’s a reason that programming books tend to avoid error handling cases in their examples, they do increase complexity! But if you were maintaining a reliable software system, I don’t think you’d be happy with me if I submitted a pull request that deleted all of the error handling code.

Let’s look at the natural world, where biology provides us with endless examples of reliable systems. Evolution has designed survival machines that just keep on going; they can heal themselves in simply marvelous ways. We humans haven’t yet figured out how to design systems which can recover from the variety of problems that a living organism can. Simple, though, they are not. They are astonishingly, mind-boggling-y complex. Organisms are the paradigmatic example of complex adaptive systems. However complex you think biology is, it’s actually even more complex than that. Mother nature doesn’t care that humans struggle to understand her design work.

Now, I’m not arguing that this reliability-that-adds-complexity is a good thing. In fact, I’m the first person who will point out that this complexity in service of reliability creates novel risks by enabling new failure modes. What I’m arguing instead is that achieving reliability by pursuing simplicity is a mirage. Yes, we should pay down tech debt and simplify our systems by reducing accidental complexity: there are gains in reliability to be had through this simplifying work. But I’m also arguing that successful systems are always going to get more complex over time, and some of that complexity is due to work that improves reliability. Successful reliable systems are going to inevitably get more complex. Our job isn’t to reduce that complexity, it’s to get better at dealing with it.

Not causal chains, but interactions and adaptations

I’ve been a bit of an anti-root-cause-analysis (RCA) tear lately. On LinkedIn, health and safety expert Vincent Theobald-Vega left a thoughtful, detailed comment on my last post. In his comment, he noted that RCA done poorly leads to bad results, and he pointed me to what he described as a good guide to using the RCA approach: Investigating accidents and incidents. This is a free book published by the UK Health and Safety Executive.

However, after reading this guide, my perception of RCA has not improved. I still believe that RCA is based on a fundamentally incorrect model of complex systems failure. To clarify my thinking, I’ve tried to explain where I see its flaws in this post.

A quick note on terminology: while the guide uses the term accident, I’m going to use the term incident instead, to remain consistent with the usage in the software domain. The guide uses incident to refer to near misses.

Some content in the guide that I appreciated

While I disagree with RCA as described in the guide, I wanted to start by pointing out areas of agreement I had with the guide.

Not just a single cause

The guide does note that there are multiple causes involved in incidents. It notes that adverse events have many causes (p6), and it also mentions that Very often, a number of chance occurrences and coincidences combine to create the circumstances in which an adverse event can happen. All these factors should be recorded here in chronological order, if possible. (p10).

While I disagree with the causal language, I do at least appreciate that it points out there are multiple factors.

Examine how work is really done

The guide does talk about examining the work and the context under which it takes place. Under “information and insights gained from an investigation”, one of the bullet points is A true snapshot of what really happens and how work is really done (p7).

Under the “Gathering detailed information: How and what? section, the guide asks What activities were being carried out at the time? and Was there anything unusual or different about the working conditions?” (p15)

“Human error” is not a valid conclusion

The guide is opposed to the idea of human error being identified as a cause. It notes that Investigations that conclude that operator error was the sole cause are rarely acceptable. Underpinning the ‘human error’ there will be a number of underlying causes that created the environment in which human errors were inevitable. (p10)

Examine your near misses

Finally, the guide does point out the value in investigating near misses, noting that While the argument for investigating accidents is fairly clear, the need to investigate
near misses and undesired circumstances may not be so obvious. However, investigating near misses and undesired circumstances is as useful, and very much easier than investigating accidents. (p8)

The RCA model of incidents

Here’s my attempt to sketch out a conceptual model of how incidents happened, according to the guide.

The guide distinguishes between three different types of causes:

Immediate cause – the most obvious reason why an adverse event happens (p4)
Underlying cause – the less obvious ‘system’ or ‘organisational’ reason for an adverse event happening (p5)
Root cause – an initiating event or failing from which all other causes or failings spring. Root causes are generally management, planning or organisational failings (p5).

The idea is that there is a causal chain from root cause to underlying cause to immediate cause. A combination of these immediate causes, along with chance occurrences and coincidences, combine to enable the incident.

The guide uses the metaphor of a sequence of dominos to describe this causal chain, where the initial domino (labeled “A” in the diagram below) is a root cause, and the domino labeled “B” an immediate cause.

Source: Investigating accidents and incidents, UK Health and Safety Executive, figure 4, p6

If left unaddressed, these root causes will lead to multiple incidents in the future. Hence, the goal of an RCA is to identify and eliminate the root causes in order to prevent recurrence of the incident:

The same accidents happen again and again, causing suffering and distress to an ever-widening circle of workers and their families… The findings of the investigation will form the basis of an action plan to prevent the accident or incident from happening again… (p4, emphasis mine)

To get rid of weeds you must dig up the root. If you only cut off the foliage, the weed will grow again. Similarly it is only by carrying out investigations which identify root causes that organisations can learn from their past failures and prevent future failures.(p9, emphasis mine)

The RE model of incidents

My claim is that the RCA model of incidents is dangerously incorrect about the nature of failure in complex systems. More importantly, these flaws in the RCA model lead to sub-optimal outcomes for incident investigations. In other words, we can do a lot better than RCA if we have a different model about how incidents happen.

The best way to illustrate this is to describe an alternative model that I believe more accurately models complex systems failures, and results in better investigation outcomes. I’m going to call it the resilience engineering (RE) model in this blog post, partly to encourage folks to explore the research field of resilience engineering, and partly as a way to encourage folks to check out the Resilience in Software Foundation. But you may have heard terms associated with this model, such as the New Look, the New View, Safety-II, and Learning from Incidents (LFI). My favorite summary of the RE model is Richard Cook’s very short paper How Complex Systems Fail.

Not causes but interactions

Where RCA treats causes as the first class entities of an incident, RE instead treats interactions as the first-class entity. It is the unexpected interactions of the different components in a complex system that enables the incident to occur.

Note that there’s no causal chain in this model. Instead, it’s an ever-branching web of contributing factors, which each factor is itself is influenced potentially influenced by other factors, and so on. I like how John Allspaw uses the expression the infinite hows to draw a contrast to the causal chain view of five whys. I once proposed the metaphor of the Gamma knife as a way to imagine how these interactions come together to enable an incident.

Labeling the behavior of the individual components as causes is dangerous because it obscures the fact that the problem was not the individual components themselves but that separate subsystems interacted in ways that were unpredictable and harmful. Modern software systems are essentially control systems with multiple feedback loops, and it’s effectively impossible for humans to predict how these loops are going to interact with each other and with the range of possible inputs we might throw at them. You don’t have to look any further than Kubernetes to understand both the value and the surprising behavior of feedback systems.

Under the RE model, incidents are perfect storms of complex interactions across multiple components under a particular set of circumstances. Even though this incident revealed a dangerous interaction between components A and B, the next incident may be an interaction between components D and E, and the D-E interaction may be even more likely to occur than the A-B one to re-occur.

In addition, changing the behavior of components A or B might enable new failure modes by creating the opportunity for new unexpected interactions with other components, even though it has prevented the A-B interaction.

Adaptations to compensate for existing faults

Here’s a different picture. Imagine your system as a collection of components, which I’ve denoted here as rounded boxes. To keep things simple, I’m not going to show the interactions

A collection of components that are part of your system.

Now, imagine that you experience an incident, you do an RCA, and you identify as the underlying causes that two of the components behaved incorrectly in some way. There was a fault in those components that wasn’t noticed before the incident.

The RCA reveals that the underlying causes were the behavior of the two components, shaded here in red

The RCA model would look for the root cause of these faults, perhaps a problem in the way that these components were validated. For example, perhaps there was a certain type of testing that wasn’t done, and that’s how the problem went undetected. As a result, not only would these two components be fixed, but we would also have improved the process by which we verify components, meaning fewer component problems in the future.

Now, let’s look at the RE model. This model tells us that there are what Cook calls latent failures that are distributed throughout the system: they’re there, but we don’t know where they are. Sometimes these latent failures are referred to as faults.

In addition to the known failures in red, there are a large number of unseen latent failures

Despite the presence of all of these faults in our complex system, the system actually functions most of the time. Cook describes this by observing that complex systems are heavily and successfully defended against failure and complex systems run in degraded mode. Even though your system is riddled with faults, it still functions well enough to be useful, although it never functions perfectly.

This is actually one of the secrets of services that seem reliable to their end users. It’s not that they never encounter problems, it’s that they are able to compensate for those problems in order to keep working correctly. In the RE model, successful complex systems are always fault-tolerant, because they need to be in order to succeed.

Because there are so many latent failures, and they change over time, the RCA approach (find a root cause, and root it out) doesn’t work under the RE model to generate continuous improvement. Because an incident was due to a random combination of multiple latent failures, and because there are so many of these failures, simply eliminating the recurrence of a specific combination doesn’t buy you much: the future incidents are very likely to be different because they’ll involve novel combinations of latent failures that you don’t see.

In contrast, the RE approach emphasizes the idea of identifying how your system adapts to succeed in the presence of all these faults. The desired outcomes of this approach are to increase your ability to continue to adapting to faults in the future, as well as to find areas in your system where you are less able to adapt effectively. It means understanding that your system is fault tolerant, and using incidents to understand how the people in your system are able to adapt to deal with faults.

This also includes understanding how those adaptations can fail to keep the system running. Because, when an incident happened, those adaptations weren’t sufficient. But there’s a huge difference between “this process led to a fault and so it needs to be changed” (RCA) and “the way we normally work is typically effective at working around problem X, but it didn’t work in these particular circumstances because Y and Z and …”

The RCA approach is about finding the generators of faults in your organization and removing them. The RE approach is about finding the sources of fault tolerance in your organization so you can nurture and grow them. The RE folks call this adaptive capacity. Remember, your system contains a multitude of undiscovered faults, and those faults will ultimately result in surprising incidents, no matter how many root causes you identify and eliminate. Consider trying the RE approach. After all, you’re going to need all of the fault tolerance you can get.