Brief thoughts on the recent Cloudflare outage

I was at QCon SF during the recent Cloudflare outage (I was hosting the Stories Behind the Incidents track), so I hadn’t had a real chance to sit down and do a proper read-through of their public writeup and capture my thoughts until now. As always, I recommend you read through the writeup first before you read my take.

All quotes are from the writeup unless indicated otherwise.

Hello saturation my old friend

The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

One thing I hope readers take away from this blog post is the complex systems failure mode pattern that resilience engineering researchers call saturation. Every complex system out there has limits, no matter how robust that system is. And the systems we deal with have many, many different kinds of limits, some of which you might only learn about once you’ve breached that limit. How well a system is able to perform as it approaches one of its limits is what resilience engineering is all about.

Each module running on our proxy service has a number of limits in place to avoid unbounded memory consumption and to preallocate memory as a performance optimization. In this specific instance, the Bot Management system has a limit on the number of machine learning features that can be used at runtime. Currently that limit is set to 200, well above our current use of ~60 features.

In this particular case, the limit was set explicitly.

thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

As sparse as the panic message is, it does explicitly tell you that the problematic call site was an unwrap call. And this is one of the reasons I’m a fan of explicit limits over implicit limits: you tend to get better error messages than when breaching an implicit limit (e.g., of your language runtime, the operating system, the hardware).

A subsystem designed to protect surprisingly inflicts harm

Identify and mitigate automated traffic to protect your domain from bad bots. – Cloudflare Docs

The problematic behavior was in the Cloudflare Bot Management system. Specifically, it was in the bot scoring functionality, which estimates the likelihood that a request came from a bot rather than a human.

This is a system that is designed to help protect their customer from malicious bots, and yet it ended up hurting their customers in this case rather than helping them.

As I’ve mentioned previously, once your system achieves a certain level of reliability, it’s the protective subsystems that end up being things that bite you! These subsystems are a net positive, they help much more than they hurt. But they also add complexity, and complexity introduces new, confusing failure modes into the system.

The Cloudflare case is a more interesting one than the typical instances of this behavior I’ve seen, because Cloudflare’s whole business model is to offer different kinds of protection, as products for their customers. It’s protection-as-a-service, not an internal system for self-protection. But even though their customers are purchasing this from a vendor rather than building it in-house, it’s still an auxiliary system intended to improve reliability and security.

Confusion in the moment

What impressed me the most about this writeup is that they documented some aspects of what it was like responding to this incident: what they were seeing, and how they tried to made sense of it.

In the internal incident chat room, we were concerned that this might be the continuation of the recent spate of high volume Aisuru DDoS attacks:

Man, if I had a nickel every time I saw someone Slack “Is it DDOS?” in response to a surprising surge of errors returned by the system, I could probably retire at this point.

The spike, and subsequent fluctuations, show our system failing due to loading the incorrect feature file. What’s notable is that our system would then recover for a period. This was very unusual behavior for an internal error.

We humans are excellent at recognizing patterns based on our experience, and that generally serves us well during incidents. Someone who is really good at operations can frequently diagnose the problem very quickly just by, say, the shape of a particular graph on a dashboard, or by seeing a specific symptom and recalling similar failures that happened recently.

However, sometimes we encounter a failure mode that we haven’t seen before, which means that we don’t recognize the signals. Or we might have seen a cluster of problems recently that followed a certain pattern, and assume that the latest one looks like the last one. And these are the hard ones.

This fluctuation made it unclear what was happening as the entire system would recover and then fail again as sometimes good, sometimes bad configuration files were distributed to our network. Initially, this led us to believe this might be caused by an attack. 

This incident was one of those hard ones: the symptoms were confusing. The “problem went away, then came back, then went away again, then came back again” type of unstable incident behavior is generally much harder to diagnose than one where the symptoms are stable.

Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence, it led some of the team diagnosing the issue to believe that an attacker may be targeting both our systems as well as our status page.

Here they got bit by a co-incident, an unrelated failure of their status page that led them to believe (reasonably!) that the problem must have been external.

I’m still curious as to what happened with their status page. The error message they were getting mentions CloudFront, so I assume they were hosting their status page on AWS. But their writeup doesn’t go into any additional detail on what the status page failure mode was.

But the general takeaway here is that even the most experienced operators are going to take longer to deal with a complex, novel failure mode, precisely because it is complex and novel! As the resilience engineering folks say, prepare to be surprised! (Because I promise, it’s going to happen).

A plea: assume local rationality

The writeup included a screenshot of the code that had an unhandled error. Unfortunately, there’s nothing in the writeup that tells us what the programmer was thinking when they wrote that code.

In the absence of any additional information, a natural human reaction is to just assume that the programmer was sloppy. But if you want to actually understand how these sorts of incidents actually happen, you have to fight this reaction.

People always make decisions that make sense to them in the moment, based on what they know and what constraints they are operating under. After all, if that wasn’t true, then they wouldn’t have made that decision. The only we can actually understand the conditions that enable incidents, we need to try as hard as we can to put ourselves into the shoes of the person who made that call, to understand what their frame of mind was at the moment.

If we don’t do that, we risk the problem of distancing through differencing. We say, “oh, those devs were bozos, I would never have made that kind of mistake”. This is a great way to limit how much you can learn from an incident.

Detailed public writeups as evidence of good engineering

The writeup produced by Cloudflare (signed by the CEO, no less!) was impressively detailed. It even includes a screenshot of a snippet of code that contributed to the incident! I can’t recall ever reading another public writeup with that level of detail.

Companies generally err on the side of saying less rather than more. After all, if you provide more detail, you open yourself up to criticism that the failure was due to poor engineering. The fewer details you provide, the fewer things people can call you out on. It’s not hard to find people online criticizing Cloudflare online using the details they provided as the basis for their criticism.

Now, I think it would advance our industry if people held the opposite view: the more details that are provided an incident writeup, the higher esteem we should hold that organization. I respect Cloudflare is an engineering organization a lot more precisely because they are willing to provide these sorts of details. I don’t want to hear what Cloudflare should have done from people who weren’t there, I want to hear us hold other companies up to Cloudflare’s standard for describing the details of a failure mode and the inherently confusing nature of incident response.

You’ll never see attrition referenced in an RCA

In the wake of the recent AWS us-east-1 outage, I saw speculation online about how the departure of experienced engineers played a role in the outage. The most notable one was from the acerbic cloud economist Corey Quinn, in a column he wrote for The Register: Amazon brain drain finally sent AWS down the spout. Amazon’s recent announcement that it will be laying off about 14,000 employees, which includes cuts to AWS, has added fuel to that fire, as I saw in a LinkedIn post by Java luminary and former AWS-er James Gosling that referenced another speculative column on the subject Amazon Just Proved AI Isn’t The Answer Yet Again. I’m not going to comment on the accuracy of these assessments, or more broadly the role that attrition played on this particular incident, because I don’t have any special knowledge here. Instead, I want to use this as an opportunity to talk about the relationship between attrition and incidents, and how that relationship is captured in incident write-ups, both public and internal.

In a public incident write-up, or an RCA provided by a vendor to a customer, you’re never going to see any discussion of the role of attrition. This is because, as noted by John Allspaw in his post What makes public posts about incidents different from analysis write-ups, the purpose of a public write-up is to reassure the audience that the problem that caused the incident is being addressed. This means that the write-up will focus on describing a technical problem and alluding to the technical solution that is being addressed to fix the problem. Attrition isn’t a technical problem, it’s a completely different type of phenomenon. And, as we’ve seen with the recent Amazon layoff announcement, attrition is sometimes an explicit business decision. If a company like Amazon mentioned attrition in a public write-up, it would be much more difficult to answer a question like “how will your upcoming layoff increase the risk of incidents?” There’s no plausible deniability (“it won’t increase the risk of incidents”) if you’ve previously talked about attrition in a public write-up. Because talking about attrition doesn’t fulfill the confidence-building role of the write-up, it’s not going to ever find its way into a document intended for outsiders.

Internal incident write-ups serve a different purpose, and so they don’t have this problem. Indeed, in my own career, I have seen references to the departure of expertise in internal incident write-ups. The first example that comes to mind is the hot potato scenario where there’s a critical service where the original authors are no longer at the company, and the team that originally owned it no longer exists, and so another team becomes responsible for operating that service, even though they don’t have deep knowledge of how the service actually works, and it is so reliable that the team that now owns it doesn’t accumulate operational experience with it. I would wager that every tech company of a certain size has seen this pattern. I’ve also frequently heard discussion of bus factor, which is an explicit reference to attrition risk.

Still, while referencing attrition isn’t a taboo in an internal incident write-up the way it is in a public incident write-up, you’re still not likely to see the topic discussed there. Internal incident write-ups take a narrow view of system failures, focusing on technical details. I wrote a blog post several years ago titled What’s allowed to count as a cause?, and attrition is an example of an issue that falls squarely in the “not allowed to count” category.

Now, you might say, “Lorin, this is exactly why five whys is good, so we can zoom out to identify systemic issues.” My response would be, “attrition is never going to be the sole reason for a failure in a complex system, and identifying only attrition as a factor is just as bad as identifying a different factor and neglecting attrition, because you’re missing so much.” I think of the role of attrition as a contributor to incidents the way that smoking is a contributor to lung cancer, or that climate change is a contributor to severe weather events. It isn’t possible to attribute a particular incidence of lung cancer to smoking, or a particular severe storm to climate changes: smoking is neither necessary nor sufficient for lung cancer, and climate change is neither necessary nor sufficient for a particular storm to be severe. But as with attrition, smoking and climate changes are factors that increase risk. If you use a root cause analysis approach to understanding incidents, you’ll miss the role of contributing factors like attrition.

I would go so far to say that organizational factors play a role in every major incident, where attrition is just one example of an organizational factor. The fact that these don’t appear in the write-up says more about the questions that people didn’t ask than it does about the nature of the incident.

Quick thoughts on the recent AWS outage

AWS recently posted a public write-up of the us-east-1 incident that hit them this past Monday. Here are a couple of quick thoughts on it.

Reliability → Automation → Complexity → New failure modes

Our industry addresses reliability problems by adding automation so that the system can handle faults automatically. But here’s the thing: adding this sort of automation increases the complexity in the system. This increase in complexity due to more sophisticated automation brings two costs along with it. One cost is that the behavior of the system becomes more difficult to reason about. This is the “what is it currently doing, and why is it doing that?” problem that we operators face. The second cost of the increased complexity is that, while this automation eliminates a known class of failure modes, it simultaneously introduces a new class of failure modes. These new failure modes occur much less frequently than the class of failure modes that were eliminated, but when they do occur, they are potentially much more severe.

According to Amazon’s write-up, the triggering event was the unintentional deletion of DNS records related to the DynamoDB service due to a race condition. Even though DNS records were fully restored by 2:25 AM PDT, it wasn’t until 3:01 PM, over twelve and a half hours later, that Amazon declared that all AWS services had been fully restored.

There were multiple issues that complicated the restoration of different AWS services, but the one I want to call out here involved the Network Load Balancer (NLB) service. Delays in the propagation of network state information led to false health check failures: there were EC2 instances that were healthy, but that the NLB categorized as unhealthy because of the network state issue. From the report:

During the event the NLB health checking subsystem began to experience increased health check failures. This was caused by the health checking subsystem bringing new EC2 instances into service while the network state for those instances had not yet fully propagated. This meant that in some cases health checks would fail even though the underlying NLB node and backend targets were healthy. This resulted in health checks alternating between failing and healthy. This caused NLB nodes and backend targets to be removed from DNS, only to be returned to service when the next health check succeeded.

This pathological health check behavior led to availability zone DNS failovers, which reduced capacity and led to connection errors.

The alternating health check results increased the load on the health check subsystem, causing it to degrade, resulting in delays in health checks and triggering automatic AZ DNS failover to occur. For multi-AZ load balancers, this resulted in capacity being taken out of service. In this case, an application experienced increased connection errors if the remaining healthy capacity was insufficient to carry the application load.

Health checks are a classic example of an automation system that is designed to improve reliability. It’s not uncommon for an instance to go unhealthy for some reason, and being able to automatically detect when that happens and take the instance out of the load balancer means that your system can automatically handle failures in individual instances. But, as we see in this case, the presence of this reliability-improving automation made a particular problem (delay in network propagation state) even worse.

As a result of this incident, Amazon is going to change the behavior of the NLB logic in the case of health check failures.

For NLB, we are adding a velocity control mechanism to limit the capacity a single NLB can remove when health check failures cause AZ failover.

Note that this is yet another increase in automation complexity with the goal of improving reliability! That doesn’t mean that this is a bad corrective action, or that health checks are bad. Instead, my point here is that adding automation complexity to improve reliability always involves a trade-off. It’s very easy to forget about that trade-off if you focus only on the existing reliability problem you’re trying to tackle, and not even consider what new reliability problems you are introducing. Even if those new problems are rare, they can be extremely painful, as AWS can attest to.

I’ve written previously about failures due to reliability-improving automation. The other examples from my linked post are also from AWS incidents, but this phenomenon is in no way specific to AWS.

Surprise should not be surprising

Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with [the DropletWorkflow Manager] without causing further issues.

The Amazon engineers didn’t have a runbook to handle this failure scenario, which meant that they had to improvise a recovery strategy during incident response. This is a recurring theme in large-scale incidents: they involve failures that nobody had previously anticipated. The only thing we can really predict about future high-severity incidents is that they are going to surprise us. We are going to keep encountering failure modes we never anticipated, over and over again.

It’s tempting to focus your reliability engineering resources on reducing the risk of known failure modes. But if you only prepare for the failure scenarios that you can think of, then you aren’t putting yourself in a better position to deal with the inevitable situation that you never imagined would ever happen. And the fact that you’re investing in reliability-improving-but-complexity-increasing automation means that you are planting the seeds of those future surprising failure modes.

This means that if you want to improve reliability, you need to invest in both the complexity-increasing reliability automation (robustness), and also in the capacity to be able to better deal with future surprises (resilience). The resilience engineering researcher David Woods uses the term net adaptive value to describe the ability of a system to deal with both predicted failure modes, and to adapt to effectively unpredicted failure modes.

Part of investing in resilience means building human-controllable leverage points so that engineers have a broad range of mitigation actions available to them during future incidents. That could mean having additional capacity on hand that you can throw at the problem, as well as having built in various knobs and switches. As an example from this AWS incident, part of the engineers’ response was to manually disable the health check behavior.

At 9:36 AM, engineers disabled automatic health check failovers for NLB, allowing all available healthy NLB nodes and backend targets to be brought back into service. This resolved the increased connection errors to affected load balancers.

But having these sorts of knobs available isn’t enough. You need your responders to have the operational expertise necessary to know when to use it. More generally, if you want to get better at dealing with unforeseen failure mode, you need to invest in improving operational expertise, so that your incident responders are best positioned to make sense of the system behavior when faced with a completely novel situation.

The AWS write-up focuses on the robustness improvements, the work they are going to do to be better prepared to prevent a similar failure mode from happening in the future. But I can confidently predict that the next large-scale AWS outage is going to look very different from this one (although it will probably involve us-east-1). It’s not clear to me from the write-up that Amazon has learned the lesson of how it important is to prepare to be surprised.

Caveat promptor

In the wake of a major incident, you’ll occasionally hear a leader admonish the engineering organization that we need to be more careful in the future in order to prevent such incidents from happening in the future. Ultimately, these sorts of admonishments don’t help improve reliability, because they miss an essential truth about the nature of work in organizations.

One of the big ideas from resilience engineering is the efficiency-thoroughness trade-off, also known as the ETTO Principle. The ETTO principle was first articulated by Erik Hollnagel, one of the founders of the field. The idea is that there’s a fundamental trade-off between how quickly we can complete tasks, and how thorough we can be when working on each individual task. Let’s consider the work of doing software development using AI agents through the lens of the ETTO principle.

Coding agents like Claude Code and OpenAI are capable of automatically generating significant amounts of code. Honestly, it’s astonishing what these tools are capable of today. But like all LLMs, while they will always generate plausiblelooking output, they do not always generate correct output. This means that a human needs to check an AI agent’s work to ensure that it’s generating code that’s up to snuff: a human has to review the code generated by the agent.

Screenshot of asking Claude about coding mistakes. Note the permanent warning at the bottom.

As any human software engineer will tell you, reviewing code is hard. It takes effort to understand code that you didn’t write. And larger changes are harder to review, which means that the more work that the agent does, the more work the human in the loop has to do to verify it.

If the code compiles and runs and all tests pass, how much time should the human spend on reviewing it? The ETTO principle tells us there’s a trade-off here: the incentives push software engineers towards completing our development tasks more quickly, which is why we’re all adopting AI in the first place. After all, if it ends up taking just as long to review the AI-generated code as it would have for the human reviewer to write it from scratch, then that defeats the purpose of automating the development task to begin with.

Maybe at first we’re skeptical and we spend more time reviewing the agent code. But, as we get better at working with the agents, and as the AI models themselves get better over time, we’ll figure out where the trouble spots of AI-generated code tend to pop up, and we’ll focus our code review effort accordingly. In essence, we’re riding the ETTO trade-off curve by figuring out how much review effort we should be putting in to and where that effort should go.

Eventually, though, a problem with AI-generated code will slip through this human review process and will contribute to an incident. In the wake of this incident, the software engineers will be reminded that AI agents can make mistakes, and that they need to carefully review the generated code. But, as always, such reminders will do nothing to improve reliability. Because, while AI agents change way that software developers work, they don’t eliminate the efficiency-thoroughness trade-off.

A statistic is as a statistic does

(With apologies to the screenwriters of Forrest Gump)

I’m going to use this post to pull together some related threads from different sources I’ve been reading lately.

Rationalization as discarding information

The first thread is from The Control Revolution by the late American historian and sociologist James Beniger, which was published back in the 1980s: I discovered this book because it was referenced in Neil Postman’s Technopoly.

Beniger references Max Weber’s concept of rationalization, which I had never heard of before. I’m used to the term “rationalization” as a pejorative term meaning something like “convincing yourself that your emotionally preferred option is the most rational option”, but that’s not how Weber meant it. Here’s Beniger, emphasis mine (from p15):

Although [rationalization] has a variety of meanings … most definitions are subsumed by one essential idea: control can be increased not only by increasing the capacity to process information but also by decreasing the amount of information to be processed.

In short, rationalization might be defined as the destruction or ignoring of information in order to facilitate its processing.

This idea of rationalization feels very close to James Scott’s idea of legibility, where organizations depend on simplified models of the system in order to manage it.

Decision making: humans versus statistical models

The second thread is from Benjamin Recht, a professor of computer science at UC Berkeley who does research in machine learning. Recht wrote a blog post recently called The Actuary’s Final Word about the performance of algorithms versus human experts on performing tasks such as medical diagnosis. The late American psychology professor Paul Meehl argued back in the 1950s that the research literature showed that statistical models outperformed human doctors when it came to diagnosing medical conditions. Meehl’s work even inspired the psychologist Daniel Kahneman, who famously studied heuristics and biases.

In his post, Recht asks, “what gives?” If we have known since the 1950s that statistical models do better than human experts, why do we still rely on human experts? Recht’s answer is that Meehl is cheating: he’s framing diagnostic problems as statistical ones.

Meehl’s argument is a trick. He builds a rigorous theory scaffolding to define a decision problem, but this deceptively makes the problem one where the actuarial tables will always be better. He first insists the decision problem be explicitly machine-legible. It must have a small number of precisely defined actions or outcomes. The actuarial method must be able to process the same data as the clinician. This narrows down the set of problems to those that are computable. We box people into working in the world of machines.

This trick fixes the game: if all that matters is statistical outcomes, then you’d better make decisions using statistical methods.

Once you frame a problem as being statistical in nature, than a statistical solution will be the optimal one, by definition. But, Recht argues, it’s not obvious that we should be using the average of the machine-legible outcomes in order to do our evaluation. As Recht puts it:

How we evaluate decisions determines which methods are best. That we should be trying to maximize the mean value of some clunky, quantized, performance indicator is not normatively determined. We don’t have to evaluate individual decisions by crude artificial averages. But if we do, the actuary will indeed, as Meehl dourly insists, have the final word.

Statistical averages and safe self-driving cars

I had Recht’s post in mind when Reading Philip Koopman’s new book Embodied AI Safety. Koopman is Professor Emeritus of Electrical Engineering at Carnegie-Mellon University, he’s a safety researcher that specializes in automotive safety. (I first learned about him from his work on the Toyota unintended acceleration cases from about ten years ago).

I’ve just started his book, but these lines from the preface jumped out at me (emphasis mine):

In this book, I consider what happens once you … come to realize there is a lot more to safety than low enough statistical rates of harm.

[W]e have seen numerous incidents and even some loss events take place that illustrate “safer than human” as a statistical average does not provide everything that stakeholders will expect from an acceptably safe system. From blocking firetrucks, to a robotaxi tragically “forgetting” that it had just run over a pedestrian, to rashes of problems at emergency response scenes, real-world incidents have illustrated that a claim of significantly fewer crashes than human drivers does not put the safety question to rest.

More numbers than you can count

I’m also reading The Annotated Turing by Charles Petzold. I had tried to read Alan Turing’s original paper where he introduced the Turing machine, but found it difficult to understand, and Petzold provides a guided tour through the paper, which is exactly what I was looking for.

I’m currently in Chapter 2, where Petzold discusses the German mathematician Georg Cantor’s famous result that the real numbers are not countable, that the size of the set of real numbers is larger than the size of the set of natural numbers. (In particular, it’s the transcendental numbers like π and e that aren’t countable: we can actually count what are called the algebraic real numbers, like √2).

To tie this back to the original thread: rationalization feels like to me like the process of focusing on only the algebraic numbers (which include the integers and rational numbers), even though most of the real numbers are transcendental.

Ignoring the messy stuff is tempting because it makes analyzing what’s left much easier. But we can’t forget that our end goal isn’t to simplify analysis, it’s to achieve insight. And that’s exactly why you don’t want to throw away the messy stuff.

Fixation: the ever-present risk during incident handling

Recent U.S. headlines have been dominated by school shootings. The bulk of the stories have been about the assassination of Charlie Kirk on the campus of Utah Valley University and the corresponding political fallout. On the same day, there was also a shooting at Evergreen High School in Colorado, where a student shot and injured two of his peers. This post isn’t about those school shootings, but rather, one that happened three years ago. On May 24, 2022, at Robb Elementary School in Uvalde, Texas, 19 students and 2 teachers were killed by a shooter who managed to make his way onto the campus.

Law enforcement were excoriated for how they responded to the Uvalde shooting incident: several were fired, and two were indicted on charges of child endangerment. On January 18, 2024, the Department of Justice released the report on their investigation of the shooting:  Critical Incident Review: Active Shooter at Robb Elementary School. According to the report, there were multiple things that went wrong during the incident. Most significantly, the police originally believed that the shooter had barricaded himself in an empty classroom, where in fact shooter was in a classroom with students. There were also communication issues that resulted in a common ground breakdown during the response. But what I want to talk about in this post is the keys.

The search for the keys

During the response to the Uvalde shooting, there was significant effort by the police on the scene to locate master keys to unlock rooms 111/112 (numbered p14, PDF p48, emphasis mine).

Phase III of the timeline begins at 12:22 p.m., immediately following four shots fired inside classrooms 111 and 112, and continues through the entry and ensuing gunfight at 12:49 p.m. During this time frame, officers on the north side of the hallway approach the classroom doors and stop short, presuming the doors are locked and that master keys are necessary.

The search for keys started before this, because room 109 was locked, and had children in it, and the police wanted to evacuate those children (numbered p 13, PDF p48):

By approximately 12:09 p.m., all classrooms in the hallways have been evacuated and/or cleared except rooms 111/112, where the subject is, and room 109. Room 109 is found to be locked and believed to have children inside.

If you look at the Minute-by-Minute timeline section of the report (numbered p17, PDF p50) you’ll see the text “Events: Search for Keys” appear starting at 12:12 PM, all of the way until 12:45 PM.

The irony here is that the door to room 111/112 may have never been locked to begin with, as suggested by the following quote (numbered p15, PDF p48), emphasis mine:

At around 12:48 p.m., the entry team enters the room. Though the entry team puts the key in the door, turns the key, and opens it, pulling the door toward them, the [Critical Incident Review] Team concludes that the door is likely already unlocked, as the shooter gained entry through the door and it is unlikely that he locked it thereafter.

Ultimately, the report explicitly calls out how the search for the keys led to delays in response (numbered p xxviii, PDF p30):

Law enforcement arriving on scene searched for keys to open interior doors for more than 40 minutes. This was partly the cause of the significant delay in entering to eliminate the threat and stop the killing and dying inside classrooms 111 and 112. (Observation 10)

Fixation

In hindsight, we can see that the responders got something very important wrong in the moment: they were searching for keys for a door that probably wasn’t even locked. In this specific case, there appears to have been some communicated-related confusion about the status of the door, as shown by the following (numbered p53, PDF p86):

The BORTAC [U.S. Border Patrol Tactical Unit] commander is on the phone, while simultaneously asking officers in the hallway about the status of the door to classrooms 111/112. UPD Sgt. 2 responds that they do not know if the door is locked. The BORTAC commander seems to hear that the door is locked, as they say on the phone, “They’re saying the door is locked.” UPD Sgt. 2 repeats that they do not know the status of the door.

More generally, this sort of problem is always going to happen during incidents: we are forever going to come to conclusions during an incident about what’s happening that turn out to be wrong in hindsight. We simply can’t avoid that, no matter how hard we try.

The problem I want to focus on here is not the unavoidable getting it wrong in the moment, but the actually-preventable problem of fixation. We “fixate” when we focus solely on one specific aspect of the situation. The problem here is not searching for keys, but on searching for keys to the exclusion of other activities.

During complex incidents, the underlying problem is frequently not well understood, and so the success of a proposed mitigation strategy is almost never guaranteed. Maybe a rollback will fix things, but maybe it won’t! The way to overcome this problem is to pursue multiple strategies in parallel. One person or group focuses on rolling back a deployment that aligns in time, another looks for other types of changes that occurred around the same time, yet another investigates the logs, another looks into scaling up the amount of memory, someone else investigates traffic pattern changes, and so on. By pursuing multiple diagnostic and mitigation strategies in parallel, we reduce the risk of delaying the mitigation of the incident by blocking on the investigation of one avenue that may turn out to not be fruitful.

Doing this well requires diversity of perspectives and effective coordination. You’re more likely to come up with a broader set of options to pursue if your responders have a broader range of experiences. And the more avenues that you pursue, the more the coordination overhead increases, as you now need to keep the responders up to date about what’s going on in the different threads without overwhelming them with details.

Fixation is a pernicious risk because we’re more likely to fixate when we’re under stress. Since incidents are stressful by nature, they are effectively incubators of fixation. In the heat of the moment, it’s hard to take a breath, step back for a moment, understand what’s been tried already, and calmly ask about what the different possible options are. But the alternative is to tumble down the rabbit hole, searching for keys to a door that is already unlocked.

The hidden trade-offs of fine-grained progressive rollouts

A progressive rollout refers to the act of rolling out some new functionality gradually rather than all at once. This means that, when you initially deploy it, the change only impacts a fraction of your users. The idea behind a progressive rollout is to reduce the risk of a deployment by reducing the blast radius: if something goes wrong with the new thing during deployment, then the impact is much smaller than if you had deployed it all-at-once, to all of the traffic.

The impact of a bad rollout is shown in red

There are two general strategies for doing a progressive rollout. One strategy is coarse grained, where you stage your deploys across domains. For example, deploying the new functionality to one geographic region at a time. The second strategy is more fine-grained, where you define a ramp up schedule (e.g., 1% of traffic to the new thing, then 5%, then 10%, etc.).

Note that the two strategies aren’t mutually exclusive: you can stage your deploy across regions, and within each region, you can do a fine-grained ramp-up within each regions. And you can also think of it as a spectrum rather than two separate categories, since you can control the granularity. But I make the distinction here because I want to talk specifically about the fine-grained approach, where we use a ramp.

The ramp is clearly superior if you’re able to detect a problem during deployment, as shown in the diagram above. It’s a real win if you have automation that can automatically detect based on a metric like error rate. The problem with the ramp is the scenario when you don’t detect that there’s a problem with the deployment.

My claim here in this post is that if you don’t detect a problem with a fine-grained progressive rollout until after the rollout has completed, then it will tend to take you longer to diagnose what the problem is:

Paradoxically, progressive rollout can increase the blast radius by making after-the-fact diagnosis harder

Here’s my argument: once you know something is wrong with your system, but you don’t know what it is that has gone wrong, one of the things you’ll do is to look at dashboard graphs to look for a signal that identifies when the problem started, such as an increase in error rate or request latency. When you do a fine-grained progressive rollout, if something has gone wrong, then the impact will get smeared out over time, and it will be harder to identify the rollout as the relevant change by looking at a dashboard. If you’re lucky, your observability tools will let you slice on the rollout dimension. This is why I like coarse-grained rollouts, because if you have explicit deployment domains like geographical regions, then your observability tools will almost certainly let you slice the data based on those. Heck, you should have existing dashboards that already slice on it. But for fine-grained rolled-out, you may not think to slice on a particular rollout dimension (especially if you’re rolling out a bunch of things at once, all of them doing fine-grained deployments), and you might not even be able to.

To determine whether fine-grained rollouts are a net win depends on a number of factors whose values are not obvious, including:

  • the probability you detect a problem during the rollout vs after the rollout
  • how much longer it takes to diagnose the problem if not caught during rollout
  • your cost model for an incident

On the third bullet: the above diagram implicitly assumes that impact to the business is linear with respect to time. However, it might be non-linear: an hour-long incident may turn out to be more than twice as expensive as two half-hour-long incidents.

As someone who works in the reliability space, I’m acutely aware of the pain of incidents that take a long time to mitigate because they are difficult to diagnose. But I think that the trade-off of fine-grained progressive rollouts are generally not recognized as such: it’s easy to imagine the benefits when the problems are caught earlier, it’s harder to imagine the scenarios where the problem isn’t caught until later, and how harder things get because of it.

Nothing fails like a history of success

The Axiom of Experience: the future will be like the past, because, in the past, the future was like the past. – Gerald M. Weinberg, An Introduction to General Systems Thinking

Last Friday, the San Francisco Bay Area Rapid Transit system (known as BART) experienced a multiple hour outage. Later that day, the BART Deputy General Manager released a memo about the outage with some technical details. The memo is brief, but I was honestly surprised to see this amount of detail in a public document that was released so quickly after an incident, especially from a public agency. What I want to focus on in this post is this line (emphasis mine):

Specifically, network engineers were performing a cutover to a new network switch at
Montgomery St. Station… The team had already successfully performed eight similar cutovers earlier this year.

This reminded me of something I read in the Buildkite writeup from an incident that happened back in January of this year (emphasis mine):

Given the confidence gained by initial load testing and the migrations already performed over the past year, we wanted to allow customers to take advantage of their seasonal low periods to perform shard migrations, as a win-win. This caused us to discount the risk of performing migrations during a seasonal low period and what impacts might emerge when regular peak traffic returned.

It also reminded me about the 2022 Rogers Telecommunications outage in Canada (emphasis mine, [redacted] comments in the original):

Rogers had assessed the risk for the initial change of this seven-phased process as “High”. Subsequent changes in the series were listed as “Medium.” [redacted] was “Low” risk based on the Rogers algorithm that weighs prior success into the risk assessment value. Thus, the risk value for [redacted] was reduced to “Low” based on successful completion of prior changes.

Whenever we make any sort of operational change, we have a mental model of the risk associated with the change. We view novel changes (I’ve never done something like this before!) as riskier than changes we’ve performed successfully multiple times in the past (I’ve done this plenty of times). I don’t think this sort of thinking is a fallacy: rather, it’s a heuristic, and it’s generally a pretty effective one! But, like all heuristics, it isn’t perfect. As shown in the examples above, the application of this heuristic can result in a miscalibrated mental model of the risk associated with a change.

So, what’s the broader lesson? In practice, our risk models (implicit or otherwise) are always miscalibrated: a history of past successes is just one of multiple avenues that can lead us astray. Trying to achieve a perfect risk model is like trying to deploy software that is guaranteed to have zero bugs: it’s never going to happen. Instead, we need to accept the reality that, like our code, our models of risk will always have defects that are hidden from us until it’s too late. So we’d better get damned good at recovery.

Easy will always trump simple

One of the early criticisms of Darwin’s theory of evolution by natural selection was about how it could account for the development of complex biological structures. It’s often not obvious to us how the earlier forms of some biological organ would have increase fitness. “What use”, asked the 19th century English biologist St. George Jackson Mivart, “is half a wing?”

One possible answer is that while half a wing might not be useful for flying, it may have had a different function, and evolution eventually repurposed that half-wing for flight. This concept, that evolution can take some existing trait in an organism that serves a function and repurpose it to serve a different function, is called exaptation.

Biology seems to be quite good at using the resources that it has at hand in order to solve problems. Not too long ago, I wrote a review of the book How Life Works: A User’s Guide to the New Biology by the British science writer Philip Ball. One of the main themes of the book is how biologists’ view of genes has shifted over time from the idea DNA-as-blueprint to DNA-as-toolbox. Biological organisms are able to deal effectively with a wide range of challenges by having access to a broad set of tools, which they can deploy as needed based on their circumstances.

We’ll come back to the biology, but for a moment, let’s talk about software design. Back in 2011, Rich Hickey gave a talk at the (sadly defunct) Strange Loop conference with the title Simple Made Easy (transcript, video). In this talk, Hickey drew a distinction between the concepts of simple and easy. Simple is the opposite of complex, where easy is something that’s familiar to us: the term he used to describe the concept of easy that I really liked was at hand. Hickey argues that when we do things that are easy, we can initially move quickly, because we are doing things that we know how to do. However, because easy doesn’t necessarily imply simple, we can end up with unnecessarily complex solutions, which will slow us down in the long run. Hickey instead advocates for building simple systems. According to Hickey, simple and easy aren’t inherently in conflict, but are instead orthogonal. Simple is an absolute concept, and easy is relative to what the software designer already knows.

I enjoy all of Rich Hickey’s talks, and this one is no exception. He’s a fantastic speaker, and I encourage you to listen to it (there are some fun digs at agile and TDD in this one). And I agree with the theme of his talk. But I also think that, no matter how many people listen to this talk and agree with it, easy will always win out over simple. One reason is the ever-present monster that we call production pressure: we’re always under pressure to deliver our work within a certain timeframe, and easier solutions are, by definition, going to be ones that are faster to implement. That means the incentives on software developers tilts the scales heavily towards the easy side. Even more generally, though, easy is just too effective a strategy for solving problems. The late MIT mathematics professor Gian-Carlo Rota noted that every mathematician has only a few tricks, and that includes famous mathematicians like Paul Erdős and David Hilbert.

Let’s look at two specific examples of the application of easy from the software world, specifically, database systems. The first example is about knowledge that is at-hand. Richard Hipp implemented the SQLite v1 as a compiler that would translate SQL into byte code, because he had previous experience building compilers but not building database engines. The second example is about an exaptation, leveraging an implementation that was at-hand. Postgres’s support for multi-version concurrency control (MVCC) relies upon an implementation that was originally designed for other features, such as time-travel queries. (Multi-version support was there from the beginning, but MVCC was only added in version 6.5).

Now, the fact that we rely frequently on easy solutions doesn’t necessarily mean that they are good solutions. After all, the Postgres source I originally linked to has the title The Part of PostgreSQL We Hate the Most. Hickey is right that easy solutions may be fast now, but they will ultimately slow us down, as the complexity accretes in our system over time. Heck, one of the first journal papers that I published was a survey paper on this very topic of software getting more difficult to maintain over time. Any software developer that has worked at a company other than a startup has felt the pain of working with a codebase that is weighed down by what Hickey refers to in his talk as incidental complexity. It’s one of the reasons why startups can move faster than more mature organizations.

But, while companies are slowed down by this complexity, it doesn’t stop them entirely. What Hickey refers to in his talk as complected systems, the resilience engineering researcher David Woods refers to as tangled. In the resilience engineering view, Woods’s tangled, layered networks inevitably arise in complex systems.

Hickey points out that humans can only keep a small number of entities in their head at once, which puts a hard limit on our ability to reason about our systems. But the genuinely surprising thing about complex systems, including the ones that humans build, is that individuals don’t have to understand the system for them to work! It turns out that it’s enough for individuals to only understand parts of the system. Even without anyone having a complete understanding of the whole system, we humans can keep the system up and running, and even extend its functionality over time.

Now, there are scenarios when we do need to bring to bear an understanding of the system that is greater than any one person possesses. My own favorite example is when there’s an incident that involves an interaction between components, where no one person understands all of the components involved. But here’s another thing that human beings can do: we can work together to perform cognitive tasks that none of us could do on their own, and one such task is remediating an incident. This is an example of the power of diversity, as different people have different partial understandings of the system, and we need to bring those together.

To circle back to biology: evolution is terrible at designing simple systems: I think biological systems are the most complex systems that we humans have encountered. And yet, they work astonishingly well. Now, I don’t think that we should design software the way that evolution designs organisms. Like Hickey, I’m a fan of striving for simplicity in design. But I believe that complex systems, whether you call them complected or tangled, are inevitable, they’re just baked in to the fabric of the adaptive universe. I also believe that easy is such a powerful heuristic that it is also baked in to how we build and involved systems. That being said, we should be inspired, by both biology and Hickey, to have useful tools at-hand. We’re going to need them.

Cloudflare and the infinite sadness of migrations

(With apologies to The Smashing Pumpkins)

A few weeks ago, Cloudflare experienced a major outage of their popular 1.1.1.1 public DNS resolver.

On July 14th, 2025, Cloudflare made a change to our service topologies that caused an outage for 1.1.1.1 on the edge, resulting in downtime for 62 minutes for customers using the 1.1.1.1 public DNS Resolver as well as intermittent degradation of service for Gateway DNS.

Cloudflare (@cloudflare.social) 2025-07-16T03:45:10.209Z

Technically, the DNS resolver itself was working just fine: it was (as far as I’m aware) up and running the whole time. The problem was that nobody on the Internet could actually reach it. The Cloudflare public write-up is quite detailed, and I’m not going to summarize it here. I do want to bring up one aspect of their incident, because it’s something I worry about a lot from a reliability perspective: migrations.

Cloudflare’s migration

When this incident struck, Cloudflare supported two different ways of managing what they call service topologies. There was a newer system that supported progressive rollout, and an older system where the changes occurred globally. The Cloudflare incident involved the legacy system, which makes global changes, which is why the blast radius of this incident was so large.

Source: https://blog.cloudflare.com/cloudflare-1-1-1-1-incident-on-july-14-2025/

Cloudflare engineers were clearly aware that these sorts of global changes are dangerous. After all, I’m sure that’s one of the reasons why they built their new system in the first place. But migrating all of the way to the new thing takes time.

Migrations and why I worry about them

If you’ve ever worked at any sort of company that isn’t a startup, you’ve had to deal with a migration. Sometimes a migration impacts only a single team that owns the system in question, but often migrations are changes that are large in scope (typically touching many teams) which, while providing new capabilities to the organization as a whole, don’t provide much short-term benefit to the teams who have to make a change to accommodate the migration.

A migration is a kind of change that, almost by definition, the system wasn’t originally designed to accommodate. We build our systems to support making certain types of future changes, and migrations are exactly not these kinds of changes. Each migration is typically a one-off type of change. While you’ll see many migrations if you work at a more mature tech company, each one will be different enough that you won’t be able to leverage common tooling from one migration to help make the next one easier.

All of this adds up to reliability risk. While a migration-related change wasn’t a factor in the Cloudflare incident, I believe that such changes are inherently risky, because you’re making a one-off change to the way that your system works. Developers generally have a sense that these sorts of changes are risky. As a consequence, for an individual on a team who has to do work to support somebody else’s migration, all of the incentives push them towards dragging their feet: making the migration-related change takes time away from their normal work, and increases the risk they break something. On the other hand, completing the migration generally doesn’t provide them short-term benefit. The costs typically outweigh the benefits. And so all of the forces push towards migrations taking a long time.

But a delay in implementing a migration is also a reliability risk, since migrations are often used to improve the reliability of the system. The Cloudflare incident is a perfect example of this: the newer system was safer than the old one, because it supported staged rollout. And while they ran the new system, they had to run the old one as well.

Why run one system when you can run two?

The scariest type of migration to me is the big bang migration, where you cut over all at once from the old system to the new one. Sometimes you have no choice, but it’s an approach that I personally would avoid whenever possible. The alternative is to do incremental migration, migrating parts of the system over time. To do incremental migration, you need to run the old system and the new system concurrently, until you’ve completely finished the migration and can shut the old system down. When I worked at Netflix, people used the term Roman riding to refer to running the old and new system in parallel, in reference to a style of horseback riding.

What actual Roman riding looks like

The problem with Roman riding is that it’s risky as well. While incremental is safer than big bang, running two systems concurrently increases the complexity of the system. There are many, many opportunities for incidents while you’re in the midst of a migration running the two systems in parallel.

What is to be done?

I wish I had a simple answer here. But my unsatisfying one is that engineering organizations at tech companies need to make migrations a part of their core competency, rather than seeing them as one-off chores. I frequently joke that platform engineering should really be called migration engineering, because any org large enough to do platform engineering is going to be spending a lot of its cycles doing migrations.

Migrations are also unglamorous work: nobody’s clamoring for the title of migration engineer. People want to work on greenfield projects, not deal with the toil of a one-off effort to move the legacy thing onto the new thing. There’s also not a ton written on doing migrations. A notable exception is (fellow TLA+ enthusiast) Marianne Bellotti’s book Kill It With Fire, which sits on my bookshelf, and which I really should re-read.

I’ll end this post with some text from the “Remediation and follow-up steps” of the Cloudflare writeup:

We are implementing the following plan as a result of this incident:

Staging Addressing Deployments: Legacy components do not leverage a gradual, staged deployment methodology. Cloudflare will deprecate these systems which enables modern progressive and health mediated deployment processes to provide earlier indication in a staged manner and rollback accordingly.

Deprecating Legacy Systems: We are currently in an intermediate state in which current and legacy components need to be updated concurrently, so we will be migrating addressing systems away from risky deployment methodologies like this one. We will accelerate our deprecation of the legacy systems in order to provide higher standards for documentation and test coverage.

I’m sure they’ll prioritize this particular migration because of the attention garnered on it from this incident. But I also bet there are a whole lot more in-flight migrations at Cloudflare, as well as at other companies, that increase complexity through maintaining two systems and delaying moving to the safer thing. What are they actually going to do in order to complete those other migrations more quickly? If it was easy, it would already be done.