Another way to rate incidents

Every organization I’m aware of that does incident management has some sort of severity rating system. The highest severity is often referred to as either a SEV1 or a SEV0 depending on the organization. (As is our wont, we software types love arguing about whether indexing should begin at zero or at one).

Severity can be a useful shorthand for communicating to an organization during incident response, although it’s a stickier concept than most people realize (for example, see Em Ruppe’s SRECon ’24 talk What Is Incident Severity, but a Lie Agreed Upon? and Dan Slimmon’s blog post Incident SEV scales are a waste of time). However, after the incident has been resolved, severity serves a different purpose: the higher the severity, the more attention the incident will get in the post-incident activities. I was reminded of this by John Allspaw’s Fix-mas Pep Talk, which is part of Uptime Labs’s Fix-mas Countdown, a sort of Advent of Incidents. In the short video, John argues for the value in spending time analyzing lower severity incidents, instead of only analyzing the higher severity ones.

Even if you think John’s idea a good one (and I do!), lower severity incidents happen more often than higher severity ones, and you probably don’t have the resources to analyze every single lower severity incident that comes along. And that got me thinking: what if, in addition to rating an incident by severity, we also gave each incident a separate rating on its learning potential? This would be a judgment on how much insight we think we would get if we did a post-incident analysis, which will help us decide whether we should spend the time actually doing that investigation.

Now, there’s a paradox here, because we have to make this call before we’ve done an actual post-incident investigation, which means we don’t yet know what we’ll learn! And, so often, what appears on the surface to be an uninteresting incident is actually much more complex once we start delving into the details.

However, we all have a finite number of cycles. And so, like it or not, we always have to make a judgment about which incidents we’re going to spend our engineering resources on in doing an analysis. The reason I like the idea of having a learning potential assessment is that it forces us to put initial investment into looking for those interesting threads that we could pull on. And it also makes explicit that severity and learning potential or two different concerns. And, as software engineers, we know that separation of concerns is a good idea!

Quick takes on the Triple Zero Outage at Optus – the Schott Review

I have no information about how this incident came to be but I can confidently predict that people will blame it on greedy execs and sloppy devs, regardless of what the actual details are. And they will therefore learn nothing from the details.
— @norootcause.surfingcomplexity.com on Bluesky (@norootcause) July 19, 2024

On September 18, 2025, the Australian telecom company Optus experienced an incident where many users were unable to make emergency service calls from their cell phones. For almost 14 hours, about 75% of calls made to 000 (the Australian version of 911) failed to go through, when from South Australia, Western Australia, the Northern Territory, and parts of New South Wales.

The Optus Board of Directors commissioned an independent review of the incident, led by Dr. Kerry Schott. On Thursday, Optus released Dr. Schott’s report, which the press release refers to as the Schott Review. This post contains my quick thoughts on the report.

As always, I recommend that you read the report yourself first before reading this post. Note that all quotes are from the Schott Review unless indicated otherwise.

The failure mode: a brief summary

I’ll briefly recap my understanding of the failure mode, based on my reading of the report. (I’m not a network engineer, so there’s a good chance I’ll get some details wrong here).

Optus contracts with Nokia to do network operations work, and the problem occurred while Nokia network engineers were carrying out a planned software upgrade of multiple firewalls at the request of Optus.

There were eighteen firewalls that required upgrading. The first fifteen upgrades were successful: the problem occurred when upgrading the sixteenth firewall. Before performing the upgrade, the network engineer isolated the firewall so that it would not serve traffic while it was being upgraded. However, the traffic that would normally pass through this firewall was not rerouted to another active firewall. The resulting failure mode only affected emergency calls: regular phone calls that would normally traverse the offline firewall were automatically rerouted, but the emergency calls were not. As a result, 000 calls were blocked for the customers whose calls would normally traverse this firewall.

Mistakes were made

These mistakes can only be explained by a lack of care about a critical service and a lack of disciplined adherence to procedure. Processes and controls were in place, but the correct process was not followed and actions to implement the controls were not done or not done properly. – Independent Report – The Triple Zero Outage at Optus: 18 September 2025

One positive thing I’ll say about the tech industry: everybody at least pays lip service to the idea of blameless incident reviews. The Schott Review, on the other hand, does not. The review identifies ten mistakes that engineers made:

Some engineers failed to attend project meetings to assess impact of planned work.
The pre-work plan did not clearly include a traffic rerouting before the firewall was isolated.
Nokia engineers chose the wrong method of procedure to implement the firewall upgrade change request (missing traffic rerouting).
The three Nokia engineers who reviewed the chosen method of procedure did not notice the missing traffic rerouting step.
The risk of the change was incorrectly classified as ‘no impact’, and the urgency of the firewall upgrade was incorrectly classified as ‘urgent’.
The firewall was incorrectly classified as ‘low risk’ in the configuration management database (CMDB) asset inventory
The upgrade was performed using the wrong method of procedure
Both a Nokia and an Optus alert fired during the upgrade. Nokia engineers assumed the alert was a false alarm due to the upgrade. Optus engineers opened an incident and reached out to Nokia command centre to ask if it was related to the upgrade. No additional investigation was done.
The network engineers who did the checks to ensure that the upgrade was successful did not notice that the call failure rates were increasing.
Triple Zero call monitoring was aggregated at the national level, so it could not identify region-specific failures.

Man, if I ever end up teaching a course in incident investigations, I’m going to use this report as an example of what not to do. This is hindsight-bias-palooza, with language suffused with human error.

What’s painful to me reading this is the acute absence of curiosity about how it was that these mistakes came to happen. For example, mistakes 5 and 6 involve classifications. The explicit decision to make those classifications must have made sense to the person in the moment, otherwise they would not have made those decisions. And yet, the author conveys zero interest in that question at all.

In other cases, the mistakes are things that were not noticed. For example, mistake 4 involves three (!) reviewers not noticing an issue with a procedure, and mistake 9 involves not noticing that call failure rates were increasing. But it’s easy to point at a graph after the incident and say “this graph indicates a problem”. If you want to actually improve, you need to understand what conditions led to that not being noticed when the actual check happened. What about the way the work is done made it less likely that this would not be seen? For example, is there a UX issue in the standard dashboards that makes this hard to see? I don’t know the answer to that reading the report, and I suspect Dr. Schott doesn’t either.

What’s most damning, though, is the lack of investigation into what are labeled as mistakes 2 and 3:

The strange thing about this specific Change Request was that in all past six IX firewall upgrades in this program – one as recent as 4 September – the equipment was not isolated and no lock on any gateway was made.

For reasons that are unclear, Nokia selected a 2022 Method of Procedure which did not automatically include a traffic diversion before the gateway was locked.

How did this work actually get done? The report doesn’t say.

Now, you might say, “Hey, Lorin, you don’t know what constraints that Dr. Schott was working under. Maybe Dr. Schott couldn’t get access to the people who knew the answers to these questions?” And that would be an absolutely fair rejoinder: it would be inappropriate for me to pass judgment here without having any knowledge about how the investigation was actually done. But this is the same critique I’m leveling at this report: it passes judgment on the engineers who made decisions without actually looking at how real work normally gets done in this environment.

I also want to circle back to this line:

These mistakes can only be explained by a lack of care about a critical service and a lack of disciplined adherence to procedure. Processes and controls were in place, but the correct process was not followed and actions to implement the controls were not done or not done properly (emphasis mine).

None of the ten listed mistakes are about a process not being followed! In fact, the process was followed exactly as specified. This makes the thing labeled mistake 7 particularly egregious, because the mistake was the engineer corrrectly carrying out the selected and peer-reviewed process!

No acknowledgment of ambiguity of real work

The fact that alerts can be overlooked because they are related to ongoing equipment upgrade work is astounding, when the reason for those alerts may be unanticipated problems caused by that work.

The report calls out the Nokia and Optus engineers for not investigating the alerts that fired during the upgrade, describing this as astounding. Anybody who has done operational work can tell you that the signals that you receive are frequently ambiguous. Was this one such case? We can’t tell from the report.

In the words of the late Dr. Richard Cook:

11) Actions at the sharp end resolve all ambiguity.
Organizations are ambiguous, often intentionally, about the relationship between
production targets, efficient use of resources, economy and costs of operations, and
acceptable risks of low and high consequence accidents. All ambiguity is resolved by
actions of practitioners at the sharp end of the system. After an accident, practitioner actions may be regarded as ‘errors’ or ‘violations’ but these evaluations are heavily
biased by hindsight and ignore the other driving forces, especially production pressure.
— Richard Cook, How Complex Systems Fail

I personally find it astounding that somebody conducting an incident investigation would not delve deeper into how a decision that appears to be astounding would have made sense in the moment.

Some unrecognized ironies

There are several ironies in the report that I thought were worth calling out (emphasis mine in each case).

The firewall upgrade in this case was an IX firewall and no traffic diversion or isolation was required or performed, consistent with previous (successful) IX firewall upgrades.
Nevertheless, having expressed doubts about the procedure, it was decided by the network engineer to isolate the firewall to err on the side of caution. The problem with this decision was that the equipment was isolated, but traffic was not diverted.

Note how a judgment call intended to reduce risk actually increased it!

At the time of the incident, the only difference between the Optus network and other
telecommunication networks was the different ‘emergency time-out’ setting. This setting (timer) controls how long an emergency call remains actively attempting to connect. The emergency inactivity timer at Optus was set at 10 seconds, down from the previous 600 seconds, following regulator concerns that Google Pixel 6A devices running Android 12 were unable to reconnect to Triple Zero after failed emergency calls. Google advised users to upgrade to Android 13, which resolved the issue, but Optus also reduced their emergency inactivity timer to 10 seconds to enable faster retries after call failures.

We understand that other carriers have longer time-out sets that may range from 150 to 600 seconds. It appears that the 10 second timing setting in the Optus network was the only significant difference between Optus’ network and other Australian networks for Triple Zero behaviour. Since this incident this time setting has now been changed by Optus to 600 seconds.

One of the contributors was a timeout, which had been previously reduced from 600 seconds to 10 seconds in order to address a previous problem failed emergency calls on specific devices.

Customers should be encouraged⁷ to test their own devices for Triple Zero calls and, if in doubt, raise the matter with Optus.

⁷ This advice is contrary to Section 31 of the ECS Determination to ‘take steps to minimise non-genuine calls’. It is noted, however, that all major carriers appear to be pursuing this course of action.

In one breath, the report attributes the incident to a lack of disciplined adherence to procedure. In another, the report explicitly recommends that customers test that 000 is working, noting in passing that this is advice contradicts Australian government policy.

Why a ‘human error’ perspective is dangerous

The reason a ‘human error’ perspective like this report is dangerous is because it hides away the systemic factors that led to those errors in the first place. By identifying the problem as engineers who failed to follow the procedure or were careless (what I call sloppy devs), we learn nothing about how real work in the system actually happens. And if you don’t understand how the real work happens, you won’t understand how the incident happens.

Two bright spots

Despite these criticisms, there are two sections that I wanted to call out as positive examples, as the kind of content I’d like to see more of in these sorts of documents.

Section 4.3: Contract Management addresses a systemic issue, the relationship between Optus and a vendor of theirs, Nokia. This incident involved an interaction between the two organizations. Anyone who has been involved in an incident that involves a vendor can tell you that coordinating across organizations is always more difficult than coordinating within an organization. The report notes that Optus’s relationship with Nokia has historically been transactional, and suggests that Optus might consider whether it would see benefits moving to [a partnership] style of contract management for complex matters.

Section 5.1: A General Note on Incident Management discusses how it is more effective to have on hand a small team of trained staff who have the capacity to adapt to the circumstances as they unfold over having a large document that describes how to handle different types of emergency scenarios. If this observation gets internalized by Optus, then I think the report is actually net positive, despite my other criticisms.

What could have been

Another irony here is that Australia has a deep well of safety experts to draw from, thanks to the Griffith University Safety Science Innovation Lab. I wish a current or former associate of that lab had been contacted to do this investigation. Off the top of my head, I can name Sidney Dekker, David Provan, Drew Rae, Ben Hutchinson, and Georgina Poole. Any of them would have done a much better job.

In particular, The Schott Review is a great example of why Dekker’s The Field Guide to Understanding ‘Human Error’ remains such an important book. I presume the author of the report has never read it.

The Australian Communications and Media Authority (ACMA) is also performing an investigation into the Optus Triple Zero outage. I’m looking forward to seeing how their report compares to the Schott Review. Fingers crossed that they do a better job.

Why I don’t like “Correction of Error”

Like many companies, AWS has a defined process for reviewing incidents. They call their process Correction of Error. For example, there’s a page on Correction of Error in their Well-Architected Framework docs, and there’s an AWS blog entry titled Why you should develop a correction of error (COE).

In my previous blog post on the AWS re:Invent talk about the us-east-1 incident that happened back in October, I wrote the following:

Finally, I still grit my teeth whenever I hear the Amazonian term for their post-incident review process: Correction of Error.

On LinkedIn, Michael Fisher asked me: “Why does the name Correction of Error make you grit your teeth?” Rather than just reply in a LinkedIn comment, I thought it would be better if I wrote a blog post instead. Nothing in this post will be new to regular readers of this blog.

I hate the term “Correction of Error” because it implies that incidents occur as a result of errors. As a consequence, it suggests that the function of a post-incident review process is to identify the errors that occurred and to fix them. I think this view of incidents is wrong, and dangerously so: It limits the benefits we can get out of an incident review process.

Now, I will agree that, in just about every incident that happens, during the post-incident review work, you can identify errors that happened. This almost always involves identifying one or more bugs in code (I’ll call these software defects here). You can also identify process errors: things that people did that, in hindsight, we can recognize as having contributed to an incident. (As an aside, see Scott Snook’s book Friendly Fire: The Accidental Shootdown of U.S. Black Hawks over Northern Iraq for a case study of an incident where there were no such errors even identified. But I’ll still assume here that we can always identify errors after an incident).

So, if I agree that there are always errors involved in incidents, what’s my problem with “Correction of Error”? In short, my problem with this view is that it fundamentally misunderstands the nature the role of both software defects and human work in incidents.

More software defects than you can count

Wall Street indexes predicted nine out of the last five recessions! And its mistakes were beauties. – Paul Samuelson

Yes, you can always find software defects in the wake of an incident. The problem with attributing an incident to a software defect is that modern software systems running in production are absolutely riddled with defects. I cannot count the times I’ve read about an incident that involved a software defect that had been present for months or even years, and had gone completely undetected until conditions arose where it contributed to an outage. My claim is that there are many, many, many such defects in your system, that have yet to manifest as outages. Indeed, I think that most of this defects will never manifest as outages.

If your system is currently up (which I bet it is), and if your system currently has multiple undetected defects in it (which I also bet it does), then it cannot be the case that defects are a sufficient condition for incidents to occur. In other words, defects alone can’t explain incidents. Yes, they are part of the story of incidents, but only a part of it. By focusing solely on the defects, the errors, means that you won’t look at the systemic nature of system failures. You will not see how the existing defects interact with other behaviors in the system to enable the incident. You’re looking for errors, and unexpected interactions aren’t “errors”.

For more of my thoughts on this point, see my previous posts The problem with a root cause is that it explains too much, Component defects: RCA vs RE and Not causal chains, but interactions and adaptations.

On human work and correcting errors

Another concept that error invokes in my mind is what I called process error above. In my experience, this typically refers to either insufficient validation during development work that led to a defect making it into production, or an operational action that led to the system getting into a bad state (for example, a change to network configuration that results in services accidentally becoming inaccessible). In these cases, correction of error implies making a change to development or operational processes in order to prevent similar errors happening in the future. That sounds good, right? What’s the problem with looking at how the current process led to an error, and changing the process to prevent future errors?

There are two problems here. One problem is that there might not actually be any kind of “error” in the normal work processes, that these processes are actually successful in virtually every circumstance. Imagine if I asked, “let’s tally up the number of times the current process did not lead to an incident, versus the number of times that the current process did lead to an incident, and use that to score how effective the process is.” Correction of error implies to me that you’re looking to identify an error in the work process, it does not imply “let’s understand how the normal work actually gets done, and how it was a reasonable thing for people to typically work that way.” In fact, changing the process may actually increase the risk of future incidents! You could be adding constraints, which could potentially lead to new dangerous workarounds. What I want is a focus on understanding how the work normally gets done. Correction of error implies the focus is specifically on identifying the error and correcting it, not on understanding the nature of the work and how decisions made sense in the moment.

Now, sometimes people need to use workarounds in order to get their work done because there are constraints that are preventing them from doing the work the way they are supposed to do it, and the workaround is dangerous in some way, and that contributes to the incident. And this is an important insight to take away from an incident! But this type of workaround isn’t an error, it’s an adaptation. Correction of error to me implies changing work practices identified as erroneous. Changing work practices is typically done by adding constraints on the way work is done. And it’s these exact type of constraints that can increase risk!

Remember, errors happen every single day, but incidents don’t. Correction of error evokes the idea that incidents are caused by errors. But until you internalize that errors aren’t enough to explain incidents, you won’t understand how incidents actually happen in complex systems. And that lack of understanding will limit how much you can genuinely improve the reliability of your system.

Finally, I think that correcting errors is such a feeble goal for a post-incident process. We can get so much more out of post-incident work. Let’s set our sights higher.

AWS re:Invent talk on their Oct ’25 incident

Last month, I made the following remark on LinkedIn about the incident that AWS experienced back in October.

To Amazon’s credit, there was a deep dive talk on the incident at re:Invent! OK, it wasn’t the keynote, but I’m still thrilled that somebody from AWS gave that talk. Kudos to Amazon leadership for green-lighting a detailed talk on the failure mode, and to Craig Howard in particular for giving this talk.

In my opinion, this talk is the most insightful post-incident public artifact that AWS has ever produced, and I really hope they continue to have these sorts of talks after significant outages in the future. It’s a long talk, but it’s worth it. In particular, it goes into more detail about the failure mode than the original write-up.

Tech that improves reliability bit them in this case

This is yet another example of a reliable system that fails through unexpected behavior of a subsystem whose primary purpose was to improve reliability. In particular, this incident involved the unexpected interaction of the following mechanisms, all of which are there to improve reliability.

Multiple enactor instances – to protect against individual enactor instances failing
Locking mechanism – to make it easier for engineers to reason about the system behavior
Cleanup mechanism – to protect against saturating Route 53 by using up all of the records
Transactional mechanism – to protect against the system getting into a bad state after a partial failure (this is “all succeeds or none succeeds”)
Rollback mechanism – to be able to recover quickly if a bad plan is deployed

These all sound like good design decisions to me! But in this case, they contributed to an incident, because of an unanticipated interaction with a race condition. Note that many of these are anticipating specific types of failures, but we can never imagine all types of failures, and the ones that we couldn’t imagine are the ones that bite us.

Things that made the incident hard

This talk not only discusses the failure itself, but also the incident response, and what made the incident response more difficult. This was my favorite part of the talk, and it’s the first time I can remember anybody from Amazon talking about the details of incident response like this.

Some of the issues that issues that Howard brought up:

They used UUIDs as identifiers for plans, which was difficult for the human operators to work with as compared to more human-readable identifiers
There were so many alerts firing that it took them fifteen minutes to look through all of them and find the one that told them what the underlying issue is
the logs that were outputted did not make it easy to identify the sequence of events that led to the incidents.

He noted that this illustrates how the “let’s add an alert” approach to dealing with previous incidents can actually hurt you, and that you should think about what will happen in a large future incident, rather than simply reacting to the last one.

Formal modeling and drift

This incident was triggered by a race condition, and race conditions are generally very difficult to identify in development without formal modeling. They had not initialized modeling this aspect of DynamoDB beforehand. When they did build a formal model (using TLA+) after the incident, they discovered that the original design didn’t have this race condition, but later incremental changes to the system did introduce it. This means that, at design time, if they had formally modeled the system, they wouldn’t have caught it then either, because it wasn’t there at design time.

Interestingly, they were able to use AI (Amazon Q, of course) to check correspondence between the model and the code. This gives me some hope that AI might make it a lot easier to keep models and implementation in sync over time, which would increase the value of maintaining these models.

Fumbling towards resilience engineering

Amazon is, well, not well known for embracing resilience engineering concepts.

Listening to this talk, there were elements of it that gestured in the direction of resilience engineering, which is why I enjoyed it so much. I already wrote about how Howard called out elements that made the incident harder. He also talked about how post-incident analysis can take significant time, and it’s a very different type of work than the heat-of-the-moment diagnostic work. In addition, there were some good discussion in the talk about tradeoffs. For example, he talked about caching tradeoffs in the context of negative DNS caching and how that behavior exacerbated this particular incident. He also spoke about how there are broader lessons that others can learn from this incident, even though you will never experience the specific race condition that they did. These are the kinds of topics that the resilience in software community has been going on about for years now. Hopefully, Amazon will get there.

And while I was happy that this talk spent time on the work of incident response, I wish it had gone farther. Despite the recognition earlier in the talk about how incident response was made more difficult by technical decisions, in the lessons learned section at the end, there was no discussion about “how do we design our system to make it easier for responders to diagnose and mitigate when the next big incident happens?”.

Finally, I still grit my teeth whenever I hear the Amazonian term for their post-incident review process: Correction of Error.

Quick takes on the Dec 5 Cloudflare outage

Poor Cloudflare! It was less than a month ago that they suffered a major outage (I blogged about that here), and then, yesterday, they had another outage. This one was much shorter (~25 minutes versus ~190 minutes), but was significant enough for them to do a public write-up. Let’s dive in!

Trying to make it better made it worse

As part of our ongoing work to protect customers who use React against a critical vulnerability, CVE-2025-55182, we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications, to make sure as many customers as possible were protected.

The work that triggered this incident was a change they were deploying to protect users against a recently disclosed vulnerability. This is exactly the type of change you would hope your protection-as-a-service vendor would make!

Burned by a system designed to improve reliability

We have a killswitch subsystem as part of the rulesets system which is intended to allow a rule which is misbehaving to be disabled quickly…

We have used this killswitch system on a number of occasions in the past to mitigate incidents and have a well-defined Standard Operating Procedure, which was followed in this incident.

However, we have never before applied a killswitch to a rule with an action of “execute”. When the killswitch was applied … an error was … encountered while processing the overall results of evaluating the ruleset… Lua returned an error due to attempting to look up a value in a nil value.

(emphasis added)

I am a huge fan of killswitches as a reliability feature. Systems are always going to eventually behave in ways you didn’t expect, and when that happens, the more knobs you have available to you to alter the behavior of the system at runtime, the better positioned that you will be to deal with an unexpected situation. This blog post calls out how the killswitch has helped them in the past.

But all reliability features add complexity, and that additional complexity can create new and completely unforeseen failure modes. Unfortunately, such is the nature of complex systems.

We don’t eliminate the risks, we trade them off

We made an unrelated change that caused a similar, longer availability incident two weeks ago on November 18, 2025. In both cases, a deployment to help mitigate a security issue for our customers propagated to our entire network and led to errors for nearly all of our customer base. (emphasis added)

In all of our decisions and actions, we’re making risk tradeoffs, even if we’re not explicitly aware of it. We’ve all faced this scenario when applying a security patch for an upstream dependency. This sort of take-action-to-block-a-vulnerability is a good example of a security vs. reliability trade-off.

“Fail-Open” Error Handling: As part of the resilience effort, we are replacing the incorrectly applied hard-fail logic across all critical Cloudflare data-plane components. If a configuration file is corrupt or out-of-range (e.g., exceeding feature caps), the system will log the error and default to a known-good state or pass traffic without scoring, rather than dropping requests.

There are many examples of the security vs. reliability trade-off, such as “do we fail closed or fail open?” (explicitly called out in this post!) or “who gets access by default to the systems that we might need to make changes to during an incident”?

Even more generally, there are the twin risks of “taking action” and “not taking action”. People like to compare doing software development on a running system to rebuilding the plane while we’re flying it, a simile which infuriates my colleague (and amateur pilot) J. Paul Reed. I think performing surgery is actually a better analogy. We’re making changes to a complex system, but the reason we’re making changes is that we believe the risk of not making changes is even higher.

I don’t think it’s a coincidence that the famous treatise How complex systems fail was written by an anesthesiologist, and that one of his observations in that paper was that all practitioner actions are gambles.

The routine as risky

The triggering change here was an update to the ruleset in their Web Application Firewall (WAF) product. Now, we can say with the benefit hindsight that there was risk in making this change, because it led to an incident! But we can also say that there was risk in making this change, because there is always risk with making any change. As I’ve written previously, any change can break us, but we can’t treat every change the same.

How risky was this change understood to be in advance? There isn’t enough information in the writeup to determine this; I don’t know how frequently they make these sorts of rule changes to their WAF product. But, given the domain that they work in, I suspect rule changes happen on a regular basis, and presumably this was seen as a routine sort of ruleset change. We all do our best to assess the risk of a given change, but our own internal models of risk are always imperfect. We can update them over time, but there will always be gaps, and those gaps will occasionally bite us.

Saturation rears its ugly head once again

Cloudflare’s proxy buffers HTTP request body content in memory for analysis. Before today, the buffer size was set to 128KB...

…we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications, to make sure as many customers as possible were protected.

During rollout, we noticed that our internal WAF testing tool did not support the increased buffer size. As this internal test tool was not needed at that time and had no effect on customer traffic, we made a second change to turn it off...

Unfortunately, in our FL1 version of our proxy, under certain circumstances, the second change of turning off our WAF rule testing tool caused an error state that resulted in 500 HTTP error codes to be served from our network.

(emphasis added)

I’ve written frequently about saturation as a failure mode in complex systems, where some part of the system gets overloaded or otherwise reaches some sort of limit. Here, the failure mode was not saturation: it was a logic error in a corner case that led to the Lua equivalent of null pointer exceptions, resulting in 500 errors being thrown.

Fascinatingly, saturation still managed to play a role in this incident. Here, there was an internal testing tool that became, in some sense, saturated: it couldn’t handle the 1MB buffer size that was needed for this analysis.

As a consequence, Cloudflare engineers turned off the tool, and it was the turning off of their WAF rule testing tool that triggered the error state.

Addressing known reliability risks takes time

In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur…

We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet. (emphasis added)

I wrote above about that our risk models are never perfectly accurate. However, even in the areas where our risk models are accurate, that still isn’t sufficient for mitigating risk! Addressing known risks can involve substantial engineering effort. Rolling out the new tech takes time, and, as I’ve pointed out here, the rollout itself also carries risk of triggering an outage!

This means that we should expect to frequently encounter incidents that involve a known risk where there was work in-flight to address this risk, but that this work had not yet landed before the risk manifested once again as an incident.

I wrote about this previously in a post titled Cloudflare and the infinite sadness of migrations.

There’s nothing more random than clusters

This is a straightforward error in the code, which had existed undetected for many years. (emphasis added)

Imagine I told you that there was an average of 12 plane crashes a year, and that crash events were independent, and that they were uniformly distributed across the year (i.e., they were just as likely to happen on one day as any other day). What is the probability of there being exactly one plane crash each month?

Let’s simplify the math by assuming each month is of equal length. It turns out that this probability is

P = \frac{12!}{12^{12}} \approx 0.0000537

The numerator is the number of different ways you can order 12 crashes so that there’s one in each month, and the denominator is all possible ways you can distribute 12 crashes across 12 months.

This means that the probability that there’s at least one month with multiple crashes in 99.95%. You’ll almost never get a year with exactly one crash a month, even though one crash a month is the average. Now, we humans will tend to look at one of these months with multiple crashes and assume that something has gotten worse, but having multiple crashes in at least one month in a year is actually the overwhelmingly likely thing to happen.

I bring this up because Cloudflare just had two major incidents within about a month of each other. This will almost certainly lead people to speculate about “what’s going on with reliability at Cloudflare???” Now, I don’t work at Cloudflare, and so I have no insight into whether there are genuine reliability issues there or not. But I do know that clusters of incidents are very likely to happen by random chance alone, and I think it would be a mistake to read too much into the fact that they had back-to-back incidents.

Incidents: the exceptional as routine

In yesterday’s post, I was looking at the Cloudflare’s public incident data to see if the time-to-resolve was under statistical control. Today I want to look at just the raw counts.

Here’s a graph that shows a count of incidents reported per day, color-coded by impact.

Cloudflare is reporting just under two incidents per day for the time period I looked at (2025-01-01 to 2025-11-27), for minor, major, and critical incidents that are not billing-related.

I spot checked the data to verify I wasn’t making any obvious mistakes. For example, there were, indeed, eight reported incidents on November 12, 2025:

(Now, you might be wondering: are these all “distinct” incidents, or are they related? I can’t tell from the information provided on the Cloudflare status pages. Also, the question itself illustrates the folly of counting incidents. A discrete incident is not a well-defined thing, and you might want to call something “one incident” for one purpose but “multiple incidents” for a different purpose).

Two incidents per day sounds like a lot, doesn’t it? Contrast this with AWS, which reports significantly fewer incidents than Cloudflare, despite offering a broader array of services: you can see on the AWS service health page (click on “List of events” and set “Locales” to “all locales”, or you can look at the Google sheet I copy-pasted this data into) that they reported only 36 events in that same time period, giving them an average of about one incident every nine days.

(AWS doesn’t classify impact, so I just marked the Oct 20 incident as critical and marked the others as minor, in order to make the visualization consistent with the Cloudflare graph).

But don’t let the difference in reported incidents fool you into thinking that Cloudflare deals with many more incidents than AWS does. Instead, what’s almost certainly going on is that Cloudflare is more open about reporting incidents than AWS is. I am convinced that Cloudflare’s incident reporting is much closer to reality than AWS’s. In fact, if you walked into any large tech company on any day of the week, I have high confidence that someone would be working on resolving an ongoing incident.

Incidents are always exceptional, by definition: they are negative-impacting events which we did not expect to happen. But the thing is, they’re also normal, in the sense that they happen all of the time. Now, most of these incidents are minor, which is why you aren’t constantly reading about them in the press: it’s only the large-scale conflagrations that you’ll hear about. But there are always small fires burning, along with engineers who are in the process of fighting these fires. This is the ongoing reliability work that is all-too-often invisible.

Fun with incident data and statistical process control

Last year, I wrote a post called TTR: the out-of-control metric. In that post, I argued that the incident response process(in particular, the time-to-resolution metric for incidents) will never be under statistical control. I showed two notional graphs. The first one was indicative of a process that was under statistical control:

The second graph showed a process that was not under statistical control:

And here’s what I said about those graphs:

Now, I’m willing to bet that if you were to draw a control chart for the time-to-resolve (TTR) metric for your incidents, it would look a lot more like the second control chart than the first one, that you’d have a number of incidents whose TTRs are well outside of the upper control limit.

I thought it would be fun to take a look at some actual publicly available incident data to see what a control chart with incident data actually looked like. Cloudflare’s been on my mind these days because of their recent outage so I thought “hey, why don’t I take a look at Cloudflare’s data?” They use Atlassian Statuspage to host their status, which includes a history of their incidents. The nice thing about Statuspage is that if you pass the Accept: application/json header to the /history URL, you’ll get back JSON instead of HTML, which is convenient for analysis.

So, let’s take a look at a control chart of Cloudflare’s incident TTR data to see if it’s under statistical control. I’m going into this knowing that my results are likely to be extremely unreliable: because I have no first-hand knowldge of this data, I have no idea what the relationship is between the time an incident was marked as resolved in Cloudflare’s status page and the time that customers were no longer impacted. And, in general, this timing will vary by customer, yet another reason why using a single number is dangerous. Finally, I have no experience with using statistical process control techniques, so I’ll just be plugging the data into a library that generates control charts and seeing what comes out. But data is data, and this is just a blog post, so let’s have some fun!

Filtering the data

Before the analysis, I did some filtering of their incident data.

Cloudflare categorizes each incident as one of critical, major, minor, none, maintenance. I only considered incidents that were classified as either critical, major, or minor; I filtered out the ones labeled none and maintenance.

Some incidents had extremely large TTRs. The four longest ones were 223 days, 58 days, 57 days, and 22 days, respectively. They were also all billing-related issues. Based on this, I decided to filter out any billing-related incidents.

There were a number of incidents where I couldn’t automatically determine the TTR from the JSON: These are cases where Cloudflare has a single update on the status page, for example Cloudflare D1 – API Availability Issues. The duration is mentioned in the resolve message, but I didn’t go through the additional work of trying to parse out the duration from the natural language messages (I didn’t use an AI doing any of this, although that would be a good use case!). Note that these aren’t always short incidents: Issues with Dynamic Steering Load Balancers says The unexpected behaviour was noted between January 13th 23:00 UTC and January 14th 15:45 UTC, but I can’t tell if they mean “the incident lasted for 16 hours and 45 minutes” or they are simply referring to when they detected the problem. At any rate, I simply ignored these data points.

Finally, I looked at just the 2025 incident data. That left me with 591 data points, which is a surprisingly rich data set!

The control chart

I used the pyshewhart Python package to generate the control charts. Here’s what they look like for the Cloudflare incidents in 2025:

As you can see, this is a process that is not under statistical control: there are multiple points outside of the upper control limit (UCL). I particularly enjoy how the pyshewhart package superimposes the “Not In Control” text over the graphs.

If you’re curious, the longest incident of 2025 was AWS S3 SDK compatibility inconsistencies with R2, a minor incident which lasted about 18 days. The longest major incident of 2025 was Network Connectivity Issues in Brazil, which lasted about 6 days. The longest critical incident was the one that happened back on Nov 18, Cloudflare Global Network experiencing issues, clocking in at about 7 hours and 40 minutes.

Most of their incidents are significantly shorter than these long ones. And that’s exactly the point: most of the incidents are brief, but every once in a while there is an incident that’s much longer.

Incident response will never be under statistical control

As we can see from the control chart, the Cloudflare TTR data is not under statistical control, we see clear instances of what the statisticians Donald Wheeler and David Chambers call exceptional variation in their book Understanding Statistical Process Control.

For a process that’s not under statistical control, a sample mean like MTTR isn’t informative: it has no predictive power, because the process itself is fundamentally unpredictable. Most incidents might be short, but then you hit a really tough one, that just takes you much longer to mitigate.

Advocates of statistical process control would tell you that the first thing you need to in order to improve the system is to get the process under statistical control. The grandfather of statistical process control, the American statistician Walter Shewhart, argued that you had to identify what he called Assignable Causes of exceptional variation and address those first in order to eliminate that exceptional variation, bringing the process under statistical control. Once you did that, then you could then address the Chance Causes in order to reduce the routine variation of the system.

I think we should take the lesson from statistical process control that a process which is not under statistical control is fundamentally unpredictable, and that we should reject the use of metrics like MTTR precisely because you can’t characterize a system out of statistical control with a sample mean.

However, I don’t think Shewhart’s proposed approach to bringing a system under statistical control would work for incidents. As I wrote in TTR: the out-of-control metric, an incident is an event that occurs, by definition, when our systems have themselves gone out of control. While incident response may frequently feel like it’s routine (detect a deploy was bad and roll it back!), we’re dealing with complex systems, and complex systems will occasionally fail in complex and confusing ways. There are a lot more ways that systems break, and the difference between an incident that lasts, say, 20 minutes and one that lasts four hours can come down to whether someone with a relevant bit of knowledge happens to be around and can bring that knowledge to bear.

This actually gets worse for more mature engineering organizations: the more reliable a system is, the more complex its failure modes are going to be when it actually does fail. If you reach a state where all of your failure modes are novel, then each incident will present a set of unique challenges. This means that the response will involve improvisation, and the time will depend on how well positioned the responders are to deal with this unforeseen situation.

That being said, we should always be striving to improve our incident response performance! But no matter how much better we do, we need to recognize that we’ll never be able to bring TTR under statistical control. And so a metric like MTTR will forever be useless.

Brief thoughts on the recent Cloudflare outage

I was at QCon SF during the recent Cloudflare outage (I was hosting the Stories Behind the Incidents track), so I hadn’t had a real chance to sit down and do a proper read-through of their public writeup and capture my thoughts until now. As always, I recommend you read through the writeup first before you read my take.

All quotes are from the writeup unless indicated otherwise.

Hello saturation my old friend

The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

One thing I hope readers take away from this blog post is the complex systems failure mode pattern that resilience engineering researchers call saturation. Every complex system out there has limits, no matter how robust that system is. And the systems we deal with have many, many different kinds of limits, some of which you might only learn about once you’ve breached that limit. How well a system is able to perform as it approaches one of its limits is what resilience engineering is all about.

Each module running on our proxy service has a number of limits in place to avoid unbounded memory consumption and to preallocate memory as a performance optimization. In this specific instance, the Bot Management system has a limit on the number of machine learning features that can be used at runtime. Currently that limit is set to 200, well above our current use of ~60 features.

In this particular case, the limit was set explicitly.

thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

As sparse as the panic message is, it does explicitly tell you that the problematic call site was an unwrap call. And this is one of the reasons I’m a fan of explicit limits over implicit limits: you tend to get better error messages than when breaching an implicit limit (e.g., of your language runtime, the operating system, the hardware).

A subsystem designed to protect surprisingly inflicts harm

Identify and mitigate automated traffic to protect your domain from bad bots. – Cloudflare Docs

The problematic behavior was in the Cloudflare Bot Management system. Specifically, it was in the bot scoring functionality, which estimates the likelihood that a request came from a bot rather than a human.

This is a system that is designed to help protect their customer from malicious bots, and yet it ended up hurting their customers in this case rather than helping them.

As I’ve mentioned previously, once your system achieves a certain level of reliability, it’s the protective subsystems that end up being things that bite you! These subsystems are a net positive, they help much more than they hurt. But they also add complexity, and complexity introduces new, confusing failure modes into the system.

The Cloudflare case is a more interesting one than the typical instances of this behavior I’ve seen, because Cloudflare’s whole business model is to offer different kinds of protection, as products for their customers. It’s protection-as-a-service, not an internal system for self-protection. But even though their customers are purchasing this from a vendor rather than building it in-house, it’s still an auxiliary system intended to improve reliability and security.

Confusion in the moment

What impressed me the most about this writeup is that they documented some aspects of what it was like responding to this incident: what they were seeing, and how they tried to made sense of it.

In the internal incident chat room, we were concerned that this might be the continuation of the recent spate of high volume Aisuru DDoS attacks:

Man, if I had a nickel every time I saw someone Slack “Is it DDOS?” in response to a surprising surge of errors returned by the system, I could probably retire at this point.

The spike, and subsequent fluctuations, show our system failing due to loading the incorrect feature file. What’s notable is that our system would then recover for a period. This was very unusual behavior for an internal error.

We humans are excellent at recognizing patterns based on our experience, and that generally serves us well during incidents. Someone who is really good at operations can frequently diagnose the problem very quickly just by, say, the shape of a particular graph on a dashboard, or by seeing a specific symptom and recalling similar failures that happened recently.

However, sometimes we encounter a failure mode that we haven’t seen before, which means that we don’t recognize the signals. Or we might have seen a cluster of problems recently that followed a certain pattern, and assume that the latest one looks like the last one. And these are the hard ones.

This fluctuation made it unclear what was happening as the entire system would recover and then fail again as sometimes good, sometimes bad configuration files were distributed to our network. Initially, this led us to believe this might be caused by an attack.

This incident was one of those hard ones: the symptoms were confusing. The “problem went away, then came back, then went away again, then came back again” type of unstable incident behavior is generally much harder to diagnose than one where the symptoms are stable.

Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence, it led some of the team diagnosing the issue to believe that an attacker may be targeting both our systems as well as our status page.

Here they got bit by a co-incident, an unrelated failure of their status page that led them to believe (reasonably!) that the problem must have been external.

I’m still curious as to what happened with their status page. The error message they were getting mentions CloudFront, so I assume they were hosting their status page on AWS. But their writeup doesn’t go into any additional detail on what the status page failure mode was.

But the general takeaway here is that even the most experienced operators are going to take longer to deal with a complex, novel failure mode, precisely because it is complex and novel! As the resilience engineering folks say, prepare to be surprised! (Because I promise, it’s going to happen).

A plea: assume local rationality

The writeup included a screenshot of the code that had an unhandled error. Unfortunately, there’s nothing in the writeup that tells us what the programmer was thinking when they wrote that code.

In the absence of any additional information, a natural human reaction is to just assume that the programmer was sloppy. But if you want to actually understand how these sorts of incidents actually happen, you have to fight this reaction.

People always make decisions that make sense to them in the moment, based on what they know and what constraints they are operating under. After all, if that wasn’t true, then they wouldn’t have made that decision. The only we can actually understand the conditions that enable incidents, we need to try as hard as we can to put ourselves into the shoes of the person who made that call, to understand what their frame of mind was at the moment.

If we don’t do that, we risk the problem of distancing through differencing. We say, “oh, those devs were bozos, I would never have made that kind of mistake”. This is a great way to limit how much you can learn from an incident.

Detailed public writeups as evidence of good engineering

The writeup produced by Cloudflare (signed by the CEO, no less!) was impressively detailed. It even includes a screenshot of a snippet of code that contributed to the incident! I can’t recall ever reading another public writeup with that level of detail.

Companies generally err on the side of saying less rather than more. After all, if you provide more detail, you open yourself up to criticism that the failure was due to poor engineering. The fewer details you provide, the fewer things people can call you out on. It’s not hard to find people online criticizing Cloudflare online using the details they provided as the basis for their criticism.

Now, I think it would advance our industry if people held the opposite view: the more details that are provided an incident writeup, the higher esteem we should hold that organization. I respect Cloudflare is an engineering organization a lot more precisely because they are willing to provide these sorts of details. I don’t want to hear what Cloudflare should have done from people who weren’t there, I want to hear us hold other companies up to Cloudflare’s standard for describing the details of a failure mode and the inherently confusing nature of incident response.

You’ll never see attrition referenced in an RCA

In the wake of the recent AWS us-east-1 outage, I saw speculation online about how the departure of experienced engineers played a role in the outage. The most notable one was from the acerbic cloud economist Corey Quinn, in a column he wrote for The Register: Amazon brain drain finally sent AWS down the spout. Amazon’s recent announcement that it will be laying off about 14,000 employees, which includes cuts to AWS, has added fuel to that fire, as I saw in a LinkedIn post by Java luminary and former AWS-er James Gosling that referenced another speculative column on the subject Amazon Just Proved AI Isn’t The Answer Yet Again. I’m not going to comment on the accuracy of these assessments, or more broadly the role that attrition played on this particular incident, because I don’t have any special knowledge here. Instead, I want to use this as an opportunity to talk about the relationship between attrition and incidents, and how that relationship is captured in incident write-ups, both public and internal.

In a public incident write-up, or an RCA provided by a vendor to a customer, you’re never going to see any discussion of the role of attrition. This is because, as noted by John Allspaw in his post What makes public posts about incidents different from analysis write-ups, the purpose of a public write-up is to reassure the audience that the problem that caused the incident is being addressed. This means that the write-up will focus on describing a technical problem and alluding to the technical solution that is being addressed to fix the problem. Attrition isn’t a technical problem, it’s a completely different type of phenomenon. And, as we’ve seen with the recent Amazon layoff announcement, attrition is sometimes an explicit business decision. If a company like Amazon mentioned attrition in a public write-up, it would be much more difficult to answer a question like “how will your upcoming layoff increase the risk of incidents?” There’s no plausible deniability (“it won’t increase the risk of incidents”) if you’ve previously talked about attrition in a public write-up. Because talking about attrition doesn’t fulfill the confidence-building role of the write-up, it’s not going to ever find its way into a document intended for outsiders.

Internal incident write-ups serve a different purpose, and so they don’t have this problem. Indeed, in my own career, I have seen references to the departure of expertise in internal incident write-ups. The first example that comes to mind is the hot potato scenario where there’s a critical service where the original authors are no longer at the company, and the team that originally owned it no longer exists, and so another team becomes responsible for operating that service, even though they don’t have deep knowledge of how the service actually works, and it is so reliable that the team that now owns it doesn’t accumulate operational experience with it. I would wager that every tech company of a certain size has seen this pattern. I’ve also frequently heard discussion of bus factor, which is an explicit reference to attrition risk.

Still, while referencing attrition isn’t a taboo in an internal incident write-up the way it is in a public incident write-up, you’re still not likely to see the topic discussed there. Internal incident write-ups take a narrow view of system failures, focusing on technical details. I wrote a blog post several years ago titled What’s allowed to count as a cause?, and attrition is an example of an issue that falls squarely in the “not allowed to count” category.

Now, you might say, “Lorin, this is exactly why five whys is good, so we can zoom out to identify systemic issues.” My response would be, “attrition is never going to be the sole reason for a failure in a complex system, and identifying only attrition as a factor is just as bad as identifying a different factor and neglecting attrition, because you’re missing so much.” I think of the role of attrition as a contributor to incidents the way that smoking is a contributor to lung cancer, or that climate change is a contributor to severe weather events. It isn’t possible to attribute a particular incidence of lung cancer to smoking, or a particular severe storm to climate changes: smoking is neither necessary nor sufficient for lung cancer, and climate change is neither necessary nor sufficient for a particular storm to be severe. But as with attrition, smoking and climate changes are factors that increase risk. If you use a root cause analysis approach to understanding incidents, you’ll miss the role of contributing factors like attrition.

I would go so far to say that organizational factors play a role in every major incident, where attrition is just one example of an organizational factor. The fact that these don’t appear in the write-up says more about the questions that people didn’t ask than it does about the nature of the incident.

Quick thoughts on the recent AWS outage

AWS recently posted a public write-up of the us-east-1 incident that hit them this past Monday. Here are a couple of quick thoughts on it.

Reliability → Automation → Complexity → New failure modes

Our industry addresses reliability problems by adding automation so that the system can handle faults automatically. But here’s the thing: adding this sort of automation increases the complexity in the system. This increase in complexity due to more sophisticated automation brings two costs along with it. One cost is that the behavior of the system becomes more difficult to reason about. This is the “what is it currently doing, and why is it doing that?” problem that we operators face. The second cost of the increased complexity is that, while this automation eliminates a known class of failure modes, it simultaneously introduces a new class of failure modes. These new failure modes occur much less frequently than the class of failure modes that were eliminated, but when they do occur, they are potentially much more severe.

According to Amazon’s write-up, the triggering event was the unintentional deletion of DNS records related to the DynamoDB service due to a race condition. Even though DNS records were fully restored by 2:25 AM PDT, it wasn’t until 3:01 PM, over twelve and a half hours later, that Amazon declared that all AWS services had been fully restored.

There were multiple issues that complicated the restoration of different AWS services, but the one I want to call out here involved the Network Load Balancer (NLB) service. Delays in the propagation of network state information led to false health check failures: there were EC2 instances that were healthy, but that the NLB categorized as unhealthy because of the network state issue. From the report:

During the event the NLB health checking subsystem began to experience increased health check failures. This was caused by the health checking subsystem bringing new EC2 instances into service while the network state for those instances had not yet fully propagated. This meant that in some cases health checks would fail even though the underlying NLB node and backend targets were healthy. This resulted in health checks alternating between failing and healthy. This caused NLB nodes and backend targets to be removed from DNS, only to be returned to service when the next health check succeeded.

This pathological health check behavior led to availability zone DNS failovers, which reduced capacity and led to connection errors.

The alternating health check results increased the load on the health check subsystem, causing it to degrade, resulting in delays in health checks and triggering automatic AZ DNS failover to occur. For multi-AZ load balancers, this resulted in capacity being taken out of service. In this case, an application experienced increased connection errors if the remaining healthy capacity was insufficient to carry the application load.

Health checks are a classic example of an automation system that is designed to improve reliability. It’s not uncommon for an instance to go unhealthy for some reason, and being able to automatically detect when that happens and take the instance out of the load balancer means that your system can automatically handle failures in individual instances. But, as we see in this case, the presence of this reliability-improving automation made a particular problem (delay in network propagation state) even worse.

As a result of this incident, Amazon is going to change the behavior of the NLB logic in the case of health check failures.

For NLB, we are adding a velocity control mechanism to limit the capacity a single NLB can remove when health check failures cause AZ failover.

Note that this is yet another increase in automation complexity with the goal of improving reliability! That doesn’t mean that this is a bad corrective action, or that health checks are bad. Instead, my point here is that adding automation complexity to improve reliability always involves a trade-off. It’s very easy to forget about that trade-off if you focus only on the existing reliability problem you’re trying to tackle, and not even consider what new reliability problems you are introducing. Even if those new problems are rare, they can be extremely painful, as AWS can attest to.

I’ve written previously about failures due to reliability-improving automation. The other examples from my linked post are also from AWS incidents, but this phenomenon is in no way specific to AWS.

Surprise should not be surprising

Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with [the DropletWorkflow Manager] without causing further issues.

The Amazon engineers didn’t have a runbook to handle this failure scenario, which meant that they had to improvise a recovery strategy during incident response. This is a recurring theme in large-scale incidents: they involve failures that nobody had previously anticipated. The only thing we can really predict about future high-severity incidents is that they are going to surprise us. We are going to keep encountering failure modes we never anticipated, over and over again.

It’s tempting to focus your reliability engineering resources on reducing the risk of known failure modes. But if you only prepare for the failure scenarios that you can think of, then you aren’t putting yourself in a better position to deal with the inevitable situation that you never imagined would ever happen. And the fact that you’re investing in reliability-improving-but-complexity-increasing automation means that you are planting the seeds of those future surprising failure modes.

This means that if you want to improve reliability, you need to invest in both the complexity-increasing reliability automation (robustness), and also in the capacity to be able to better deal with future surprises (resilience). The resilience engineering researcher David Woods uses the term net adaptive value to describe the ability of a system to deal with both predicted failure modes, and to adapt to effectively unpredicted failure modes.

Part of investing in resilience means building human-controllable leverage points so that engineers have a broad range of mitigation actions available to them during future incidents. That could mean having additional capacity on hand that you can throw at the problem, as well as having built in various knobs and switches. As an example from this AWS incident, part of the engineers’ response was to manually disable the health check behavior.

At 9:36 AM, engineers disabled automatic health check failovers for NLB, allowing all available healthy NLB nodes and backend targets to be brought back into service. This resolved the increased connection errors to affected load balancers.

But having these sorts of knobs available isn’t enough. You need your responders to have the operational expertise necessary to know when to use it. More generally, if you want to get better at dealing with unforeseen failure mode, you need to invest in improving operational expertise, so that your incident responders are best positioned to make sense of the system behavior when faced with a completely novel situation.

The AWS write-up focuses on the robustness improvements, the work they are going to do to be better prepared to prevent a similar failure mode from happening in the future. But I can confidently predict that the next large-scale AWS outage is going to look very different from this one (although it will probably involve us-east-1). It’s not clear to me from the write-up that Amazon has learned the lesson of how it important is to prepare to be surprised.