Here’s a thought experiment that John Allspaw related to me, in paraphrased form (John tells me that he will eventually capture this in a blog post of his own, at which time I’ll put a proper link).
Consider a small-ish tech company that has four engineering teams (A,B,C,D), where an engineer from Team A was involved in an incident (In John’s telling, the incident involves the Norway problem). In the wake of this incident, a post-incident write-up is completed, and the write-up does a good job of describing what happened. Next, imagine that the write-up is made available to teams A,B, and C, but not to team D. Nobody on team D is allowed to read the write-up, and nobody from the other teams is permitted to speak to team D about the details of the incident. The question is: are the members of team D at a disadvantage compared to the other teams?
The point of this scenario is to convey the intuition that, even though team D wasn’t involved in the incident, its members can still learn something from its details that makes them better engineers.
Switching gears for a moment, let’s talk about the new tools that are emerging under the label AI SRE. We’re now starting to see more tools that leverage LLMs to try to automate incident diagnosis and remediation, such as incident.io’s AI SRE product, Datadog’s Bits AI SRE, Resolve.ai (tagline: Your always-on AI SRE), and Cleric (tagline: AI SRE teammate). These tools work by reading in signals from your organization such as alerts, metrics, Slack messages, and source code repositories.
To effectively diagnose what’s happening in your system, you don’t just want to know what’s happening right now, but you also want to have access to historical data, since maybe there was a similar problem that happened, say, a year ago. While LLMs will have been trained with a lot of general knowledge about software systems, it won’t have been trained on the specific details of your system, and your system will fail in system-specific ways, which means that (I assume!) these AI SRE systems will work better if they have access to historical data about your system.
Here’s second thought experiment, this one my own: Imagine that you’ve adopted one of these AI SRE tools, but the only historical data of the system that you can feed your tool is the collection of your company’s post-incident write-ups. What kinds of details would be useful to an AI SRE tool in helping to troubleshoot future incidents? Perhaps we should encourage people to write their incident reports as if they will be consumed by an AI SRE tool that will use it to learn as much as possible about the work involved in diagnosing and remediating incidents in your company. I bet the humans who read it would learn more that way too.
Recent U.S. headlines have been dominated by school shootings. The bulk of the stories have been about the assassination of Charlie Kirk on the campus of Utah Valley University and the corresponding political fallout. On the same day, there was also a shooting at Evergreen High School in Colorado, where a student shot and injured two of his peers. This post isn’t about those school shootings, but rather, one that happened three years ago. On May 24, 2022, at Robb Elementary School in Uvalde, Texas, 19 students and 2 teachers were killed by a shooter who managed to make his way onto the campus.
Law enforcement were excoriated for how they responded to the Uvalde shooting incident: several were fired, and two were indicted on charges of child endangerment. On January 18, 2024, the Department of Justice released the report on their investigation of the shooting: Critical Incident Review: Active Shooter at Robb Elementary School. According to the report, there were multiple things that went wrong during the incident. Most significantly, the police originally believed that the shooter had barricaded himself in an empty classroom, where in fact shooter was in a classroom with students. There were also communication issues that resulted in a common ground breakdown during the response. But what I want to talk about in this post is the keys.
The search for the keys
During the response to the Uvalde shooting, there was significant effort by the police on the scene to locate master keys to unlock rooms 111/112 (numbered p14, PDF p48, emphasis mine).
Phase III of the timeline begins at 12:22 p.m., immediately following four shots fired inside classrooms 111 and 112, and continues through the entry and ensuing gunfight at 12:49 p.m. During this time frame, officers on the north side of the hallway approach the classroom doors and stop short, presuming the doors are locked and that master keys are necessary.
The search for keys started before this, because room 109 was locked, and had children in it, and the police wanted to evacuate those children (numbered p 13, PDF p48):
By approximately 12:09 p.m., all classrooms in the hallways have been evacuated and/or cleared except rooms 111/112, where the subject is, and room 109. Room 109 is found to be locked and believed to have children inside.
If you look at the Minute-by-Minute timeline section of the report (numbered p17, PDF p50) you’ll see the text “Events: Search for Keys” appear starting at 12:12 PM, all of the way until 12:45 PM.
The irony here is that the door to room 111/112 may have never been locked to begin with, as suggested by the following quote (numbered p15, PDF p48), emphasis mine:
At around 12:48 p.m., the entry team enters the room. Though the entry team puts the key in the door, turns the key, and opens it, pulling the door toward them, the [Critical Incident Review] Team concludes that the door is likely already unlocked, as the shooter gained entry through the door and it is unlikely that he locked it thereafter.
Ultimately, the report explicitly calls out how the search for the keys led to delays in response (numbered p xxviii, PDF p30):
Law enforcement arriving on scene searched for keys to open interior doors for more than 40 minutes. This was partly the cause of the significant delay in entering to eliminate the threat and stop the killing and dying inside classrooms 111 and 112. (Observation 10)
Fixation
In hindsight, we can see that the responders got something very important wrong in the moment: they were searching for keys for a door that probably wasn’t even locked. In this specific case, there appears to have been some communicated-related confusion about the status of the door, as shown by the following (numbered p53, PDF p86):
The BORTAC [U.S. Border Patrol Tactical Unit] commander is on the phone, while simultaneously asking officers in the hallway about the status of the door to classrooms 111/112. UPD Sgt. 2 responds that they do not know if the door is locked. The BORTAC commander seems to hear that the door is locked, as they say on the phone, “They’re saying the door is locked.” UPD Sgt. 2 repeats that they do not know the status of the door.
More generally, this sort of problem is always going to happen during incidents: we are forever going to come to conclusions during an incident about what’s happening that turn out to be wrong in hindsight. We simply can’t avoid that, no matter how hard we try.
The problem I want to focus on here is not the unavoidable getting it wrongin the moment, but the actually-preventable problem of fixation. We “fixate” when we focus solely on one specific aspect of the situation. The problem here is not searching for keys, but on searching for keys to the exclusion of other activities.
During complex incidents, the underlying problem is frequently not well understood, and so the success of a proposed mitigation strategy is almost never guaranteed. Maybe a rollback will fix things, but maybe it won’t! The way to overcome this problem is to pursue multiple strategies in parallel. One person or group focuses on rolling back a deployment that aligns in time, another looks for other types of changes that occurred around the same time, yet another investigates the logs, another looks into scaling up the amount of memory, someone else investigates traffic pattern changes, and so on. By pursuing multiple diagnostic and mitigation strategies in parallel, we reduce the risk of delaying the mitigation of the incident by blocking on the investigation of one avenue that may turn out to not be fruitful.
Doing this well requires diversity of perspectives and effective coordination. You’re more likely to come up with a broader set of options to pursue if your responders have a broader range of experiences. And the more avenues that you pursue, the more the coordination overhead increases, as you now need to keep the responders up to date about what’s going on in the different threads without overwhelming them with details.
Fixation is a pernicious risk because we’re more likely to fixate when we’re under stress. Since incidents are stressful by nature, they are effectively incubators of fixation. In the heat of the moment, it’s hard to take a breath, step back for a moment, understand what’s been tried already, and calmly ask about what the different possible options are. But the alternative is to tumble down the rabbit hole, searching for keys to a door that is already unlocked.
The Axiom of Experience: the future will be like the past, because, in the past, the future was like the past. – Gerald M. Weinberg, An Introduction to General Systems Thinking
Last Friday, the San Francisco Bay Area Rapid Transit system (known as BART) experienced a multiple hour outage. Later that day, the BART Deputy General Manager released a memo about the outage with some technical details. The memo is brief, but I was honestly surprised to see this amount of detail in a public document that was released so quickly after an incident, especially from a public agency. What I want to focus on in this post is this line (emphasis mine):
Specifically, network engineers were performing a cutover to a new network switch at Montgomery St. Station… The team had already successfully performed eight similar cutovers earlier this year.
This reminded me of something I read in the Buildkite writeup from an incident that happened back in January of this year (emphasis mine):
Given the confidence gained by initial load testing and the migrations already performed over the past year, we wanted to allow customers to take advantage of their seasonal low periods to perform shard migrations, as a win-win. This caused us to discount the risk of performing migrations during a seasonal low period and what impacts might emerge when regular peak traffic returned.
Rogers had assessed the risk for the initial change of this seven-phased process as “High”. Subsequent changes in the series were listed as “Medium.” [redacted] was “Low” risk based on the Rogers algorithm that weighs prior success into the risk assessment value. Thus, the risk value for [redacted] was reduced to “Low” based on successful completion of prior changes.
Whenever we make any sort of operational change, we have a mental model of the risk associated with the change. We view novel changes (I’ve never done something like this before!) as riskier than changes we’ve performed successfully multiple times in the past (I’ve done this plenty of times). I don’t think this sort of thinking is a fallacy: rather, it’s a heuristic, and it’s generally a pretty effective one! But, like all heuristics, it isn’t perfect. As shown in the examples above, the application of this heuristic can result in a miscalibrated mental model of the risk associated with a change.
So, what’s the broader lesson? In practice, our risk models (implicit or otherwise) are always miscalibrated: a history of past successes is just one of multiple avenues that can lead us astray. Trying to achieve a perfect risk model is like trying to deploy software that is guaranteed to have zero bugs: it’s never going to happen. Instead, we need to accept the reality that, like our code, our models of risk will always have defects that are hidden from us until it’s too late. So we’d better get damned good at recovery.
Accountability is a mechanism that achieves better outcomes by aligning incentives, in particular, negative ones. Specifically: if you do a bad thing, or fail to do a good thing, under your sphere of control, then bad things will happen to you. I recently saw several LinkedIn posts that referenced the U.S. Coast Guard report on the OceanGate experimental submarine implosion. These posts described how this incident highlights the importance of accountability in leadership. And, indeed, the report itself references accountability five times.
However, I think this incident is an example of a type of problem where accountability doesn’t actually help. Here I want to talk about two classes of problems where accountability is a poor solution to addressing the problem, where the OceanGate accident falls into the second class.
Coordination challenges
Managing a large organization is challenging. Accountability is a popular tool in such organizations to ensure that work actually gets done, by identifying someone who is designated as the stuckee for ensuring that a particular task or project gets completed. I’ll call this top-down accountability. This kind of accountability is sometimes referred to, unpleasantly, as the “one throat to choke” model.
Darth Vader enforcing accountability
For this model to work, the problem you’re trying to solve needs to be addressable by the individual that is being held accountable for it. Where I’ve seen this model fall down is in post-incident work. As I’ve written about previously, I’m a believer in the resilience engineering model of complex systems failures, where incidents arise due to unexpected interactions between components. These are coordination problems, where the problems don’t live in one specific component, but, rather, how the components interact with each other.
But this model of accountability demands that we identify an individual to own the relevant follow-up incident work. And so it creates an incentive to always identify a root cause service, which is owned by the root cause team, who are then held accountable for addressing the issue.
Now, just because you have a coordination problem, that doesn’t mean that you don’t need an individual to own driving the reliability improvements around it. In fact, that’s why technical project managers (known as TPMs) exist. They act as the accountable individuals for efforts that require coordination across multiple teams, and every large tech organization that I know of employs TPMs. The problem I’m highlighting here, such as in the case of incidents, is that accountability is applied as a solution without recognizing that the problem revealed by the incident is a coordination problem.
You can’t solve a coordination problem by identifying one of the agents involved in the coordination and making them accountable. You need someone who is well-positioned in the organization, recognizes the nature of the problem, and has the necessary skills to be the one who is accountable.
Miscalibrated risk models
The other way people talk about accountability is about holding leaders such as politicians and corporate executives responsible for their actions, where there are explicit consequences for them acting irresponsibly, including actions such as corruption, or taking dangerous risks with the people and resources that have been entrusted to them. I’ll call this bottom-up accountability.
The bottom-up accountability enforcement tool of choice in France, circa 1792
This brings us back to the OceanGate accident of June 18, 2023. In this accident, the TITAN submersible imploded, killing everyone aboard. One of the crewmembers who died was Stockton Rush, who was both pilot of the vessel and CEO of OceanGate.
The report is a scathing indictment of Rush. In particular, it criticizes how he sacrificed safety for his business goals, ran an organization that lacked that the expertise required to engineer experimental submersibles, promoted a toxic workplace culture that suppressed signs of trouble instead of addressing them, and centralized all authority in himself.
However, one thing we can say about Rush was that he was maximally accountable. After all, he was both CEO and pilot. He believed so much that TITAN was safe that he literally put his life on the line. As Nassim Taleb would put it, he had skin in the game. And yet, despite this accountability, he still took irresponsible risks, which led to disaster.
By being the pilot, Rush personally accepted the risks. But his actual understanding of the risk, his model of risk, was fundamentally incorrect. It was wrong, dangerously so.
Rush assessed the risk index of the fateful dive at 35. The average risk index of previous dives was 36.
Assigning accountability doesn’t help when there’s an expertise gap. Just as giving a software engineer a pager does not bestow up them the skills that they need to effectively do on-call operations work, having the CEO of OceanGate also be the pilot of the experimental vehicle did not lead to him being able to exercise better judgment about safety.
Rush’s sins weren’t merely lack of expertise, and the report goes into plenty of detail about his other management shortcomings that contributed to this incident. But, stepping back from the specifics of the OceanGate accident, there’s a greater point here that making executives accountable isn’t sufficient to avoid major incidents, if the risk models that executives use to make decisions are are out of whack with the actual risks. And by risk models here, I don’t just mean some sort of formal model like the risk assessment matrix above. Everyone carries with them an implicit risk model in their heads: this is a mental risk model.
Double binds
While the CEO also being a pilot sounds like it should be a good thing for safety (skin in the game!), it also creates a problem that the resilience engineering folks refer to as a double bind. Yes, Rush had strong incentives to ensure he wasn’t taking stupid risks, because otherwise he might die. But he also had strong incentives to keep the business going, and those incentives were in direct conflict with the safety incentives. But double-binds are not just an issue for CEO-pilots, because anyone in the organization will feel pressure from above to make decisions in support of the business, which may cut against safety. Accountability doesn’t solve the problem of double-binds, it exacerbates them, by putting someone on the hook for delivering.
Once again, from the resilience engineering literature, one way to deal with this problem is through cross-checks. For example, see the paper Collaborative Cross-Checking to Enhance Resilience by Patterson, Woods, Cook, and Render. Instead of depending on a single individual (accountability), you take advantage of the different perspectives of multiple people (diversity).
You also need someone who is not under a double-bind who has the authority to say “this is unsafe”. That wasn’t possible at OceanGate, where the CEO was all-powerful, and anybody who spoke up was silenced or pushed out.
On this note, I’ll leave you with a six-minute C-SPAN video clip from 2003. In this clip, the resilience engineering David Woods spoke at a U.S. Senate hearing in the wake of the Columbia accident. Here he was talking about the need for an independent safety organization at NASA as a mechanism for guarding against the risks that emerge from double binds.
One of the early criticisms of Darwin’s theory of evolution by natural selection was about how it could account for the development of complex biological structures. It’s often not obvious to us how the earlier forms of some biological organ would have increase fitness. “What use”, asked the 19th century English biologist St. George Jackson Mivart, “is half a wing?”
One possible answer is that while half a wing might not be useful for flying, it may have had a different function, and evolution eventually repurposed that half-wing for flight. This concept, that evolution can take some existing trait in an organism that serves a function and repurpose it to serve a different function, is called exaptation.
Biology seems to be quite good at using the resources that it has at hand in order to solve problems. Not too long ago, I wrote a review of the book How Life Works: A User’s Guide to the New Biology by the British science writer Philip Ball. One of the main themes of the book is how biologists’ view of genes has shifted over time from the idea DNA-as-blueprint to DNA-as-toolbox. Biological organisms are able to deal effectively with a wide range of challenges by having access to a broad set of tools, which they can deploy as needed based on their circumstances.
We’ll come back to the biology, but for a moment, let’s talk about software design. Back in 2011, Rich Hickey gave a talk at the (sadly defunct) Strange Loop conference with the title Simple Made Easy (transcript, video). In this talk, Hickey drew a distinction between the concepts of simple and easy. Simple is the opposite of complex, where easy is something that’s familiar to us: the term he used to describe the concept of easy that I really liked was at hand. Hickey argues that when we do things that are easy, we can initially move quickly, because we are doing things that we know how to do. However, because easy doesn’t necessarily imply simple, we can end up with unnecessarily complex solutions, which will slow us down in the long run. Hickey instead advocates for building simple systems. According to Hickey, simple and easy aren’t inherently in conflict, but are instead orthogonal. Simple is an absolute concept, and easy is relative to what the software designer already knows.
I enjoy all of Rich Hickey’s talks, and this one is no exception. He’s a fantastic speaker, and I encourage you to listen to it (there are some fun digs at agile and TDD in this one). And I agree with the theme of his talk. But I also think that, no matter how many people listen to this talk and agree with it, easy will always win out over simple. One reason is the ever-present monster that we call production pressure: we’re always under pressure to deliver our work within a certain timeframe, and easier solutions are, by definition, going to be ones that are faster to implement. That means the incentives on software developers tilts the scales heavily towards the easy side. Even more generally, though, easy is just too effective a strategy for solving problems. The late MIT mathematics professor Gian-Carlo Rota noted that every mathematician has only a few tricks, and that includes famous mathematicians like Paul Erdős and David Hilbert.
Let’s look at two specific examples of the application of easy from the software world, specifically, database systems. The first example is about knowledge that is at-hand. Richard Hipp implemented the SQLite v1 as a compiler that would translate SQL into byte code, because he had previous experience building compilers but not building database engines. The second example is about an exaptation, leveraging an implementation that was at-hand. Postgres’s support for multi-version concurrency control (MVCC) relies upon an implementation that was originally designed for other features, such as time-travel queries. (Multi-version support was there from the beginning, but MVCC was only added in version 6.5).
Now, the fact that we rely frequently on easy solutions doesn’t necessarily mean that they are good solutions. After all, the Postgres source I originally linked to has the title The Part of PostgreSQL We Hate the Most. Hickey is right that easy solutions may be fast now, but they will ultimately slow us down, as the complexity accretes in our system over time. Heck, one of the first journal papers that I published was a survey paper on this very topic of software getting more difficult to maintain over time. Any software developer that has worked at a company other than a startup has felt the pain of working with a codebase that is weighed down by what Hickey refers to in his talk as incidental complexity. It’s one of the reasons why startups can move faster than more mature organizations.
But, while companies are slowed down by this complexity, it doesn’t stop them entirely. What Hickey refers to in his talk as complected systems, the resilience engineering researcher David Woods refers to as tangled. In the resilience engineering view, Woods’s tangled, layered networks inevitably arise in complex systems.
Hickey points out that humans can only keep a small number of entities in their head at once, which puts a hard limit on our ability to reason about our systems. But the genuinely surprising thing about complex systems, including the ones that humans build, is that individuals don’t have to understand the system for them to work! It turns out that it’s enough for individuals to only understand parts of the system. Even without anyone having a complete understanding of the whole system, we humans can keep the system up and running, and even extend its functionality over time.
Now, there are scenarios when we do need to bring to bear an understanding of the system that is greater than any one person possesses. My own favorite example is when there’s an incident that involves an interaction between components, where no one person understands all of the components involved. But here’s another thing that human beings can do: we can work together to perform cognitive tasks that none of us could do on their own, and one such task is remediating an incident. This is an example of the power of diversity, as different people have different partial understandings of the system, and we need to bring those together.
To circle back to biology: evolution is terrible at designing simple systems: I think biological systems are the most complex systems that we humans have encountered. And yet, they work astonishingly well. Now, I don’t think that we should design software the way that evolution designs organisms. Like Hickey, I’m a fan of striving for simplicity in design. But I believe that complex systems, whether you call them complected or tangled, are inevitable, they’re just baked in to the fabric of the adaptive universe. I also believe that easy is such a powerful heuristic that it is also baked in to how we build and involved systems. That being said, we should be inspired, by both biology and Hickey, to have useful tools at-hand. We’re going to need them.
Let’s play another round where contrast the root-cause-analysis (RCA) perspective to the resilience engineering (RE) perspective. Today’s edition is about the distribution of potentially incident-causing defects across the different components in the system. Here, I’m using RCA nomenclature, since the kinds of defects that an RCA advocate would refer to as a “cause” in the wake of an incident would be called a “contributor” by the RE folks.
Here’s a stylized view of the world from the RCA perspective:
RCA view of distribution of potential incident-causing defects in the system
Note that there are a few particularly problematic components: we should certainly focus our reliability efforts on figuring out which of the components we should be spending our reliability efforts on improving!
Now let’s look at the RE perspective:
RE view of distribution of potential incident-contributing defects in the system
It’s a sea of red! The whole system is absolutely shot through with defects that could contribute to an incident!
Under the RE view, the individual defects aren’t sufficient to cause an incident. Instead, it’s an interaction of these defects with other things, including other defects. Because incidents arise due to interactions, RE types will stress the importance of understanding interactions across components over the details of the specific component that happened to contain the defect that contributed to the outage. After all, according to RE folks, those defects are absolutely everywhere. Focusing on one particular component won’t yield significant improvements under this model.
If you want to appreciate the RE perspective, you need to develop an understanding how it can be that the system is up right now despite the fact that it is absolutely shot through with all of these potentially incident-causing defects, as the RCA folks would call them. As an RE type myself, I believe that your system is up right now, and that it already contains the defect that will be implicated in the next incident. After that incident happens, the tricky part isn’t identifying the defect, it’s appreciating how the defect alone wasn’t enough to bring it down.
When writing up my impressions of the GCP incident report, Cindy Sridharan’s tweet reminded me that I failed to comment on an important part of it, how the responders brought the overloaded system back to a healthy state.
Full report for yesterday’s Google outage
– not feature flagging a new code path – null pointer crashing a binary when reading empty fields from Spanner – remediation causing a thundering herd as there was no exponential backoff – which required manual throttling to recover from pic.twitter.com/6AQ4BQk5ex
Which brings me to the topic of this post: the “what went well” section of an incident write-up. Generally, public incident write-ups don’t have such sections. This is almost certainly for rational political reasons: it would be, well, gauche to recount to your angry customers about what a great job you did handling the incident. However, internal write-ups often have such sections, and that’s my focus here.
In my experience, “What went well” is typically the shortest section in the entire incident report, with a few brief bullet points that point out some positive aspects of the response (e.g., people responded quickly). It’s a sort of way-to-go!, a way to express some positive feedback to the responders on a job well done. This is understandable, as people believe that if we focus more on what went wrong than what went well, then we are more likely to improve the system, because we are focusing on repairing problems. This is why “what went wrong” and “what can we do to fix it” takes the lion’s share of the attention.
But the problem with this perspective is that it misunderstands the skills that are brought to bear during incident response, and how learning from a previously well-handled incident can actually help other responders do better in future incidents. Effective incident response happens because the responders are skilled. But every incident response team is an ad-hoc one, and just because you happened to have people with the right set of skills responding last time, doesn’t mean you’ll have the people with the right set the next time. This means that if you gloss over what went well, your next incident might be even worse than the last one, because you’ve described those future responders of the opportunity to learn from observing the skilled responders last time.
To make this more concrete, let’s look back at that the GCP incident report. In this scenario, the engineers had put in a red-button as a safety precaution and exercised it to remediate the audience.
As a safety precaution, this code change came with a red-button to turn off that particular policy serving path… Within 2 minutes, our Site Reliability Engineering team was triaging the incident. Within 10 minutes, the root cause was identified and the red-button (to disable the serving path) was being put in place.
However, that’s not the part that interests me so much. Instead, it’s the part about how the infrastructure became overloaded as a consequence of the remediation, and how the responders recovered from overload.
Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure…. It took up to ~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional databases to reduce the load.
This was not a failure scenario that they had explicitly designed for in advance of deploying the change: there was no red-button they could simply exercise to roll back the system to a non-overloaded state. Instead, they were forced to improvise a solution based on the controls that were available to them. In this case, they were able to reduce the load by turning down the rate of task creation, as well as by re-routing traffic away from the overloaded database.
And this sort of work is the really interesting bit an incident: how skilled responders are able to take advantage of generic functionality that is available in order to remediate an unexpected failure mode. This is one of the topics that the field of resilience engineering focuses on, how incident responders are able to leverage generic capabilities during a crunch. If I was an engineer at Google in this org, I would be very interested to learn what knobs are available and how to twist them. Describing this in detail in an incident write-up will increase my chances of being able to leverage this knowledge later. Heck, even just leaving bread crumbs in the doc will help, because I’ll remember the incident, look up the write-up, and follow the links.
Another enormously useful “what went well” aspect that often gets short shrift is a description of the diagnostic work: how the responders figured out what was going on. This never shows up in public incident write-ups, because the information is too proprietary, so I don’t blame Google for not writing about how the responders determined the source of the overload. But all too often these details are left out of the internal write-ups as well. This sort of diagnostic work is a crucial set of skills for incident response, and having the opportunity to read about how experts applied their skills to solve this problem help transfers these skills across the organization.
Here’s my claim: providing details on how things went well will reduce your future mitigation time even more than focusing on what went wrong. While every incident is different, the generic skills are common, and so getting better at response will get you more mileage than preventing repeats of previous incidents. You’re going to keep having incidents over and over. The best way to get better at incident handling is to handle more incidents yourself. The second best way is to watch experts handle incidents. The better you do at telling the stories of how your incidents were handled, the more people will learn about how to handle incidents.
On Thursday (2025-06-12), Google Cloud Platform (GCP) had an incident that impacted dozens of their services, in all of their regions. They’ve already released an incident report (go read it!), and here are my thoughts and questions as I read it.
Note that the questions I have shouldn’t be explicitly seen as a critique as of the write-up, as the answers to the questions generally aren’t publicly shareable. They’re more in the “I wish I could be a fly on the wall inside of Google” questions.
Quick write-up
First, a meta-point: this is a very quick turnaround for a public incident write-up. As a consumer of these, I of course appreciate getting it faster, and I’m sure there was enormous pressure inside of the company to get a public write-up published as soon as possible. But I also think there are hard limits on how much you can actually learn about an incident when you’re on the clock like this. I assume that Google is continuing to investigate internally how the incident happened, and I hope that they publish another report several weeks from now with any additional details that they are able to share publicly.
Staging land mines across regions
Note that impact (June 12) happened two weeks after deployment (May 29).
This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code.
The system involved is called Service Control. Google stages their deploys of Service Control by region, which is a good thing: staging your changes is a way of reducing the blast radius if there’s a problem with the code. However, in this case, the problematic code path was not exercised during the regional rollout. Everything looked good in the first region, and so they deployed to the next region, and so on.
This the land mine risk: when the code you are rolling out contains a land mine which is not tripped during the rollout.
How did the decisions make sense at the time?
I have no information about how this incident came to be but I can confidently predict that people will blame it on greedy execs and sloppy devs, regardless of what the actual details are. And they will therefore learn nothing from the details.
The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash.
This is the typical “we didn’t do X in this case and had we done X, this incident wouldn’t have happened, or wouldn’t have been as bad” sort of analysis that is very common in these write-ups. The problem with this is that it implies sloppiness on the part of the engineers, that important work was simply overlooked. We don’t have any sense on how the development decisions made sense at the time.
If this scenario was atypical (i.e., usually error handling and feature flags are added), what was different about this development case? We don’t have the context about what was going on during development, which means we (as external readers) can’t understand how this incident actually was enabled.
Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging.
How do they know it would have been caught in staging, if it didn’t manifest in production until two weeks after roll-out? Are they saying that adding a feature flag would have led to manual testing of the problematic code path in staging? Here I just don’t know enough about Google’s development processes to make sense of this observation.
Service Control did not have the appropriate randomized exponential backoff implemented to avoid [overloading the infrastructure].
As I discuss later, I’d wager it’s difficult to test for this in general, because the system generally doesn’t run in the mode that would exercise this. But I don’t have the context, so it’s just a guess. What’s the history behind Service Control’s backoff behavior? By definition, Without knowing its history, we can’t really understand how its backoff implementation came to be this way.
Red buttons and feature flags
As a safety precaution, this code change came with a red-button to turn off that particular policy serving path. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. (emphasis added)
Because I’m unfamiliar with Google’s internals, I don’t understand how their “red button” system works. In my experience, the “red button” type functionality is built on top of feature flag functionality, but that does not seem to be the case at Google, since here there was no feature flag, but there was a big red button.
It’s also interesting to me that, while this feature wasn’t feature-flagged it was big-red-buttoned. There’s a story here! But I don’t know what it is.
New feature: additional policy quota checks
On May 29, 2025, a new feature was added to Service Control for additional quota policy checks… On June 12, 2025 at ~10:45am PDT, a policy change was inserted into the regional Spanner tables that Service Control uses for policies.
I have so many questions.. What were these additional quota policy checks? What was the motivation for adding these checks (i.e., what problem are the new checks addressing)? Is this customer-facing functionality (e.g., GCP Cloud Quotas), or is this an internal-only? What was the purpose of the policy change that was inserted on June 12 (or was it submitted by a customer)? Did that policy change take advantage of the new Service Control features that were added on May 29? Was that the first policy change that happened since the new feature was deployed, or had there been others? How frequently do policy changes happen?
Global data changes
Code changes are scary, config changes are scarier, and data changes are the scariest of them all.
Given the global nature of quota management, this metadata was replicated globally within seconds.
While code and feature flag changes are staged across regions, apparently quota management metadata is designed to replicate globally.
Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues. (emphasis mine)
The implication I take from from the text was that there was a business requirement for quota management data changes to happen globally rather than staged, and that they are now going to push back on that.
What was the rationale for this business requirement? What are the tradeoffs involved in staging these changes versus having them happen globally? What new problems might arise when data changes are staged like this?
Are we going to be reading a GCP incident report in a few years that resulted from inconsistency of this data across regions due to this change?
Saturation!
From an operational perspective, I remain terrified of databases
Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure.
Here we have a classic example of saturation, where a database got overloaded. Note that saturation wasn’t the trigger here, but it made recovery more difficult. Our system is in a different mode during incident recovery than it is during normal mode, and it’s generally very difficult to test for how it will behave when it’s in recovery mode.
Does this incident match my conjecture?
I have a long-standing conjecture that once a system reaches a certain level of reliability, most major incidents will involve:
A manual intervention that was intended to mitigate a minor incident, or
Unexpected behavior of a subsystem whose primary purpose was to improve reliability
I don’t have enough information in this write-up to be able to make a judgment in this case: it depends on whether or not the quota management system’s purpose is to improve reliability. I can imagine it going either way. If it’s a public-facing system to help customers limit their costs, then that’s more of a traditional feature. On the other hand, if it’s to limit the blast radius of individual user activity, then that feels like a reliability improvement system.
What are the tradeoffs of the corrective actions?
The write-up lists seven bullets of corrective actions. The questions I always have of corrective actions are:
What are the tradeoffs involved in implementing these corrective actions?
How might they enable new failure modes or make future incidents more difficult to deal with?
A year ago, Mihail Eric wrote a blog post detailing his experiences working on AI inside Amazon: How Alexa Dropped the Ball on Being the Top Conversational System on the Planet. It’s a great first-person account, with lots of detail of the issues that kept Amazon from keeping up with its peers in the LLM space. From my perspective, Eric’s post makes a great case study in what resilience engineering researchers refer to as brittleness, which is a term that the researchers use to refer to as a kind of opposite of resilience.
In the paper Basic Patterns in How Adaptive Systems Fail, the researchers David Woods and Matthieu Branlat note that brittle systems tend to suffer from the following three patterns:
Decompensation: exhausting capacity to adapt as challenges cascade
Working at cross-purposes: behavior that is locally adaptive but globally maladaptive
Getting stuck in outdated behaviors: the world changes but the system remains stuck in what were previously adaptive strategies (over-relying on past successes)
Eric’s post demonstrates how all three of these patterns were evident within Amazon.
Decompensation
It would take weeks to get access to any internal data for analysis or experiments… Experiments had to be run in resource-limited compute environments. Imagine trying to train a transformer model when all you can get a hold of is CPUs. Unacceptable for a company sitting on one of the largest collections of accelerated hardware in the world.
If you’ve ever seen a service fall over after receiving a spike in external requests, you’ve seen a decompensation system failure. This happens when a system isn’t able to keep up with the demands that are placed upon on it.
In organizations, you can see the decompensation failure pattern emerge when decision-making is very hierarchical: you end up having to wait for the decision request to make its way up to someone who has the authority to make the decision, and then make its way down again. In the meantime, the world isn’t standing still waiting for that decision to be made.
As described in the Bad Technical Process section of Eric’s post, Amazon was not able to keep up with the rate at which its competitors were making progress on developing AI technology, even though Amazon had both the talent and the compute resources necessary in order to make progress. The people inside the organization who needed the resources weren’t able to get them in a timely fashion. That slowed down AI development and, consequently, they got lapped by their competitors.
Working at cross-purposes
Alexa’s org structure was decentralized by design meaning there were multiple small teams working on sometimes identical problems across geographic locales.
This introduced an almost Darwinian flavor to org dynamics where teams scrambled to get their work done to avoid getting reorged and subsumed into a competing team.
The consequence was an organization plagued by antagonistic mid-managers that had little interest in collaborating for the greater good of Alexa and only wanted to preserve their own fiefdoms.
My group by design was intended to span projects, whereby we found teams that aligned with our research/product interests and urged them to collaborate on ambitious efforts. The resistance and lack of action we encountered was soul-crushing.
Where decompensation is a consequence of poor centralization, working at cross-purposes is a consequence of poor decentralization. In a decentralized organization, the individual units are able to work more quickly, but there’s a risk of alignment: enabling everyone to row faster isn’t going to help if they’re rowing in different directions.
In the Fragmented Org Structures section of Eric’s writeup, he goes into vivid, almost painful detail about how Amazon’s decentralized org structure worked against them.
Getting stuck in outdated behaviors
Alexa was viciously customer-focused which I believe is admirable and a principle every company should practice. Within Alexa, this meant that every engineering and science effort had to be aligned to some downstream product.
That did introduce tension for our team because we were supposed to be taking experimental bets for the platform’s future. These bets couldn’t be baked into product without hacks or shortcuts in the typical quarter as was the expectation.
So we had to constantly justify our existence to senior leadership and massage our projects with metrics that could be seen as more customer-facing.
…
This introduced product/science conflict in every weekly meeting to track the project’s progress leading to manager churn every few months and an eventual sunsetting of the effort.
I’m generally not a fan of management books, but What got you here won’t get you there is a pretty good summary of the third failure pattern: when organizations continue to apply approaches that were well-suited to problems in the past but are ill-suited to problems in the present.
In the Product-Science Misalignment section of his post, Eric describes how Amazon’s traditional viciouslycustomer-focused approach to development was a poor match for the research-style work that was required for developing AI. Rather than Amazon changing the way they worked in order to facilitate the activities of AI researchers, the researchers had to try to fit themselves into Amazon’s pre-existing product model. Ultimately, that effort failed.
I write mostly about software incidents on this blog, which are high-tempo affairs. But the failure of Amazon to compete effectively in the AI space, despite its head start with Alexa, its internal talent, and its massive set of compute resources, can also be viewed as a kind of incident. As demonstrated in this post, we can observe the same sorts of patterns in failures that occur in the span of months as we can in failures that occur in the span of minutes. How well Amazon is able to learn from this incident remains to be seen.
Simplicity is prerequisite for reliability. — Edsger W. Dijkstra
Think about a system whose reliability had significantly improved over some period of time. The first example that comes to my mind is commercial aviation, but I’d encourage you to think of a software system you’re familiar with, either as a user (e.g., Google, AWS) or as a maintainer of a system that’s gotten more reliable over time.
Think of a system where the reliability trend looks like this
Now, for the system you have thought about where its reliability increased over time, think about what the complexity trend looks like over time for that system. I’d wager you’d see a similar sort of trend.
My claim about what the complexity trend looks like over time
Now, in general, increases in complexity don’t lead to increases in reliability. In some cases, engineers make a deliberate decision to trade off reliability for new capabilities. The telephone system today is much less reliable than it was when I was younger. As someone who grew up in the 80s and 90s, the phone system was so reliable that it was shocking to pick up the phone and not hear a dial tone. We were more likely to experience a power failure than a telephony outage, and the phones still worked when the power was out! I don’t think we even knew the term “dropped call”. Connectivity issues with cell phones are much more common than they ever were with landlines. But this was a deliberate tradeoff: we gave up some reliability in order to have ubiquitous access to a phone.
Other times, the increase in complexity isn’t the product of an explicit tradeoff but rather an entropy-like effect of a system getting more difficult to deal with over time as it accretes changes. This scenario, the one that most people have in mind when they think about increasing complexity in their system, is synonymous with the idea of tech debt. With tech debt the increase in complexity makes the system less reliable, because the risk of making a breaking change in the system has increased. I started this blog post with a quote from Dijkstra about simplicity. Here’s another one, along the same lines, from C.A.R. Hoare’s Turing Award Lecture in 1980:
There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.
What Dijkstra and Hoare are saying is: the easier a software system is to reason about, the more likely it is to be correct. And this is true: when you’re writing a program, the simpler the program is, the more likely that you are to get it right. However, as we scale up from individual programs to systems, this principle breaks down. Let’s see how that happens.
Djikstra claims simplicity is a prerequisite for reliability. According to Dijkstra, if we encounter a system that’s reliable, it must be a simple system, because simplicity is required to achieve reliability.
reliability ⇒ simplicity
The claim I’m making in this post is the exact opposite: systems that improve in reliability do so by adding features that improves reliability, but come at the cost of increased complexity.
reliability ⇒ complexity
Look at classic works on improving the reliability of real-world systems like Michael Nygard’s Release It!, Joe Armstrong’s Making reliable distributed systems in the presence of software errors, and Jim Gray’s Why Do Computers Stop and What Can Be Done About It? and think about the work that we do to make our software systems more reliable, functionality like retries, timeouts, sharding, failovers, rate limiting, back pressure, load shedding, autoscaling, circuit breakers, transactions, and auxiliary systems we have to support our reliability work like an observability stack. All of this stuff adds complexity.
Imagine if I took a working codebase and proposed deleting all of the lines of code that are involved in error handling. I’m very confident that this deletion of code would make the codebase simpler. There’s a reason that programming books tend to avoid error handling cases in their examples, they do increase complexity! But if you were maintaining a reliable software system, I don’t think you’d be happy with me if I submitted a pull request that deleted all of the error handling code.
Let’s look at the natural world, where biology provides us with endless examples of reliable systems. Evolution has designed survival machines that just keep on going; they can heal themselves in simply marvelous ways. We humans haven’t yet figured out how to design systems which can recover from the variety of problems that a living organism can. Simple, though, they are not. They are astonishingly, mind-boggling-y complex. Organisms are the paradigmatic example of complex adaptive systems. However complex you think biology is, it’s actually even more complex than that. Mother nature doesn’t care that humans struggle to understand her design work.
Now, I’m not arguing that this reliability-that-adds-complexity is a good thing. In fact, I’m the first person who will point out that this complexity in service of reliability creates novel risks by enabling new failure modes. What I’m arguing instead is that achieving reliability by pursuing simplicity is a mirage. Yes, we should pay down tech debt and simplify our systems by reducing accidental complexity: there are gains in reliability to be had through this simplifying work. But I’m also arguing that successful systems are always going to get more complex over time, and some of that complexity is due to work that improves reliability. Successful reliable systems are going to inevitably get more complex. Our job isn’t to reduce that complexity, it’s to get better at dealing with it.