Lots of AI SRE, no AI incident management

With the value of AI coding tools now firmly established in the software industry, the next frontier is AI SRE tools. There are a number of AI SRE vendors. In some cases, vendors are adding AI SRE functionality to extend their existing product lineup, a quick online search reveals one such as PagerDuty’s SRE Agents, Datadog’s Bits AI SRE, incident.io’s AI SRE, Microsoft’s Azure SRE Agent, and Rootly’s AI SRE. There are also a number of pure play AI SRE startups: the ones I’ve heard of are Cleric, Resolve.ai, Anyshift.io, and RunWhen. My sense of the industry is that AI SRE is currently in the evaluation phase, compared to the coding tools which are in the adoption phase.

What I want to write about today is not so much what these AI tools do contribute to resolving incidents, but rather what they don’t contribute. These tools are focused on diagnostic and mitigation work. The idea is to try to automate as much as possible the work of figuring out what the current problem is, and then resolving it. I think most of the focus is, rightly, on the diagnostic side at this stage, although I’m sure automated resolution is also something being pursued. But what none of these tools try to do, as far as I can tell, is incident management.

The work of incident response always involves a group of engineers: some of them are officially on-call, and others are just jumping in to help. Incident management is the coordination work that helps this ad-hoc team of responders work together effectively to get the diagnostic and remediation work done. Because of this, we often say that incident response is a team sport. Incidents involve some sort of problem with the system as a whole, and because everybody in the organization only has partial knowledge of the whole system, we typically need to pool that knowledge together to make sense of what’s actually happening right now in the system. For example, if a database is currently being overloaded, the folks who own the database could tell you that there’s been a change in query pattern, but they wouldn’t be able to tell you why that change happened. For that, you’d need to talk to the team that owns the system that makes those queries.

Fixation: the single-agent problem

Down the rabbit hole. Source: Sincerely Media

Another reason why we need multiple people responding to incidents is that humans are prone to a problem known as fixation. You might know it by the more colloquial term tunnel vision. A person will look at a problem from a particular perspective, and that can be problematic if the person addressing the problem has a perspective that is not well-matched to solving that problem. You can even see fixation behavior in the current crop of LLM coding tools: they will sometimes keep going down an unproductive path in order to implement a feature or try to resolve an error. While I expect that future coding agents will suffer less from fixation, given that genuinely intelligent humans frequently suffer from this problem, I don’t think that we’ll ever see an individual coding agent get to the point where it completely avoids fixation traps.

One solution to the problem of fixation is to intentionally inject a diversity of perspectives by having multiple individuals attack the problem. In the case of AI coding tools, we deal with the problem of fixation by having a human supervise the work of the coding agent. The human spots when the agent falls down a fixation rabbit hole, and prompts the agent to pursue a different strategy in order to get it back on track. Another way to leverage multiple individuals to is to strategically have them pursue different strategies. For example, in the early oughts, there was a lot of empirical software engineering research into an approach called perspective-based reading for reviewing software artifacts like requirements or design documents. The idea is that you would have multiple reviewers, and you would explicitly assign a reviewer a particular perspective. For example, let’s say you wanted to get a requirements document reviewed. You could have one reviewer read it from the perspective of a user, another from the perspective of a designer, and a third from the perspective of a tester. The idea here is that reading from a different perspective would help identify different kinds of defects in the artifact.

Getting back to incidents, the problem of fixation arises when a responder latches on to one particular hypothesis about what’s wrong with the system, and continues following on that particular line of investigation, even though it doesn’t bear fruit. As discussed above, having responders with a diverse set of perspectives provides a defense against fixation. This may take the form of multiple lines of doing multiple lines of investigation, or even just somebody in the response asking a question like, “How do we know the problem isn’t Y rather than X?”

I’m convinced that an individual AI SRE agent will never be able to escape the problem of fixation, and so that incident response will necessarily involve multiple agents. Yes, there will be some incidents where a single AI agent is sufficient. But incident response is a 100% game: you need to recover from all of them. That means that eventually you’ll need to deploy a team of agents, whether they’re humans, AI, or a mix. And that means incident response will require coordination: in particular, maintaining common ground.

Maintaining common ground is active work

During an incident, many different things are happening at once. There are multiple signals that you need to keep track of, like “what’s the current customer impact?”, “is the problem getting better, worse, or staying the same?”, “what are the current hypotheses?”, “which graphs support or contradict those hypotheses?” The responders will be doing diagnostic work, and they’ll be performing interventions to the system, sometimes to try to mitigate (e.g., “roll back that feature flag that aligns in time”), and other times to support the diagnostic work (e.g., “we need to make a change to figure out if hypothesis X is actually correct.”)

The incident manager helps to maintain common ground: they make sure that everybody is on the same page, by doing things like helping bring people up to speed on what’s currently going on, and ensuring people know which lines of investigation are currently being pursued and who (if anyone) is currently pursuing them.

If a responder is just joining an incident, an AI SRE agent is extremely useful as a summary machine. You can ask it the question, “what’s going on?”, and it can give you a concise summary of the state of play. But this is a passive use case: you prompt it, and it gives a response. But because the state of the world is changing rapidly during the incident, the accuracy of that answer will decay rapidly with time. Keeping the current state of things up to date in the minds of the responders is an active struggle against entropy.

An effective AI incident manager would have to be able to identify what type of coordination help people need, and then provide that assistance. For example, the agent would have to be able to identify when the responders (be they human or agent) were struggling and then proactively take action to assist. It would need a model of the mental models of the responders to know when to act and what to action to take in order to re-establish common ground.

Perhaps there is work in the AI SRE space to automate this sort of coordination work. But if there is, I haven’t heard of it yet. The focus today is on creating individual responder agents. I think these agents will be an effective addition to an incident response team. I’d love it if somebody built an effective incident management AI bot. But it’s a big leap from AI SRE agent to AI incident management agent. And it’s not clear to me how well the coordination problem is understood by vendors today.

On variability

I was listening to Todd Conklin’s Pre-Accident Investigation Podcast the other day, to the episode titled When Normal Variability Breaks: The ReDonda Story. The name ReDonda in the title refers to ReDonda Vaught, an American registered nurse. In 2017, she was working at the Vanderbilt University Medical Center in Nashville when she unintentionally administered the wrong drug to a patient under her care, a patient who later died. Vaught was fired, then convicted by the state of Tennessee for criminally negligent homicide and abuse of an impaired adult. It’s a terrifying story, really a modern tale of witch-burning, but it’s not what this post is about. Instead, I want to home in a term from the podcast title: normal variability.

In the context of the field of safety, the term variability refers to how human performance is, well, variable. We don’t always do the work the exact same way. This variation happens between humans, where different people will do work in different ways. And the variation also happens within humans, the same person will perform a task differently over time. The sources of variation in human performance are themselves varied: level of experience, external pressures being faced by the person, number of hours of sleep the night before, and so on.

In the old view of safety, there is an explicitly safe way to perform the work, as specified in documented procedures. Follow the procedures, and incidents won’t happen. In the software world, these procedures might be: write unit tests for new code, have the change reviewed by a peer, run end-to-end tests in staging, and so on. Under this view of the world, variability is necessarily a bad thing. Since variability means people do work differently, and since safety requires doing work the proscribed way, human variability is a source of incidents. Traditional automation doesn’t have this variability problem: it always does the work the same way. Hence you get the old joke:

The factory of the future will have only two employees: a man and a dog. The man will be there to feed the dog. The dog will be there to keep the man from touching the equipment.

In the new view of safety, normal variability is viewed as an asset rather than a liability. In this view, the documented procedures for doing the work are always inadequate, they can never capture all of the messy details of real work. It is the human ability to adapt, to change the way that they do the work based on circumstances, that creates safety. That’s why you’ll hear resilience engineering folks use the (positive) term adaptive capacity rather than the (more neutral) human variability, to emphasize that human variability is, quite literally, adaptive. This is why tech companies still staff on-call rotations even though they have complex automation that is supposed to keep things up and running. It’s because the automation can never handle all of the cases that the universe will throw at it. Even sophisticated automation always eventually proves too rigid to be able to handle some particular circumstance that was never foreseen by the designers. This is the perfect-storm, weird-edge-case stuff that post-incident write-ups are made of.

This, again, brings us back to AI.

My own field of software development is being roiled by the adoption of AI-based coding tools like Anthropic’s Claude Code, OpenAI’s Codex, and Google’s Gemini Code Assist. These AI tools are rapidly changing the way that software is being developed, and you can read many blog posts of early adopters who are describing their experiences using these new tools. Just this week, there was a big drop in the market value of multiple software companies; I’ve already seen references to the beginning of the SaaS-Pocalypse, the idea being that companies will write bespoke tools using AI rather than purchasing software from vendors. The field of software development has seen a lot of change in terms of tooling in my own career, but one thing that is genuinely different about these AI-based tools is that they are inherently non-deterministic. You interact with these tools by prompting them, but the same prompt yields different results.

Non-determinism in software development tools is seen as a bad thing. The classic example of non-determinism-as-bad is flaky tests. A flaky test is non-deterministic: the same input may lead to a pass or a fail. Nobody wants non-determinism like this in our test suite. On the build side of things, we hope that our compiler emits the same instructions given the same source file and arguments. There’s even a whole movement around reproducible builds, the goal of which is to stamp out all of the non-determinism in the process of producing binaries from the original source code, where the ideal is achieving bit-for-bit identical binaries. Unsurprisingly, then, the non-determinism of the current breed of AI coding tools is seen as a problem. Here’s a quote from a recent article in the Wall Street Journal by Chip Cutter and Sebastian Herrera: Here’s Where AI Is Tearing Through Corporate America:

Satheesh Ravala is chief technology officer of Candescent, which makes digital technology used by banks and credit unions. He has fielded questions from employees about what innovations like Anthropic’s new features mean for the company, and responded by telling them banks rely on the company for software that does exactly what it’s supposed to every time—something AI struggles with. 

“If I want to transfer $10,” he said, “it better be $10 not $9.99.”

I believe the AI coding tools are only going to improve with time, though I don’t feel confident in predicting whether future improvements will be orders-of-magnitude or merely incremental. What I do feel confident in predicting is that the non-determinism in these tools isn’t going away.

At their heart, these tools are sophisticated statistical models: they are prediction machines. When you’re chatting with one, it is predicting the next word to say, and then it feeds back the entire conversation so far, predicts the next word to say again, and so on. Because they are statistical models, there is some probability distribution of next word to predict. You could build the system to always choose the most likely word to say next. Statistical models aren’t just an AI thing, and many statistical models do use such a maximum likelihood approach. But that’s not what LLMs do in general. Instead, there’s some randomness that is intentionally injected into the system so that it doesn’t always just pick the most likely next word, but instead does a biased random selection of the next word, based on the statistical model of what’s most likely to come next, and based on a parameter called temperature, drawing an analogy to physics. If the temperature is zero, then the system always outputs the most likely next word. The higher the temperature, the more random the selection is.

What’s fascinating to me about this is the deliberate injection of randomness improved the output of the models, as judged qualitatively by humans. In other words, increasing the variability of the system improved outcomes.

Now, these LLMs haven’t achieved the level of adaptability that humans possess, though they can certainly perform some impressive cognitive tasks. I wouldn’t say they have adaptive capacity, and I firmly believe that humans will still need to be on-call for software system for the remainder of my career, despite the proliferation of AI SRE solutions. But what I am saying instead is that the ability of LLMs to perform cognitive tasks well depends upon them being able to leverage variability. And my prediction is that this dependence on variability isn’t going to go away. LLMs will get better, and they might even get much better, but I don’t think they’ll ever be deterministic. I think variability is an essential ingredient for a system to be able to perform these sorts of complex cognitive tasks.

On intuition and anxiety

Over at Aeon, there’s a thoughtful essay written by the American anesthesiologist Ronald Dworkin about how he unexpectedly began suffering from anxiety after returning to work from a long vacation. During surgeries he became plagued with doubt, where he experienced difficulty making decisions during scenarios that had never been a problem for him before.

Dworkin doesn’t characterizes his anxiety as the addition of something new to his state of being. Instead, he interprets becoming anxious as having something taken away from him, as summed up by the title of his essay: When I lost my intuition. To Dworkin, anxiety is the absence of intuition, its opposite.

To compensate for his newfound challenges in decision-making, Dworkin adopts an evidence-based strategy, but the strategy doesn’t work. He struggles with a case that involves a woman who had chewed gum before her scheduled procedure. Gum chewing increases gastric juice in the stomach, which raises the risk of choking while under anesthetic. Should he delay the procedure? He looks to medical journals for guidance, but the anesthesiology studies he finds on the effect of chewing gum were conducted in different contexts from his situation, and their results conflict with each other. This decision cannot be outsourced to previous scientific research: studies can provide context, but he must make the judgment call.

Dworkin looks to psychology for insight into the nature of intuition, so he can make sense of what he has lost. He name checks the big ideas from both academic psychology and pop psychology about intuition, including Herb Simon’s bounded rationality, Daniel Kahneman’s System 1 and System 2, Roger Sperry’s concept of analytic left-brain, intuitive right-brain, and the Myers-Briggs personality test notion of intuitive vs analytical. My personal favorite, the psychologist Gary Klein, receives only a single sentence in the essay:

In The Power of Intuition (2003), the research psychologist Gary Klein says the intuitive method can be rationally communicated to others, and enhanced through conscious effort.

In addition, Klein’s naturalistic decision-making model is not even mentioned explicitly. Instead, it’s the neuroscientist Joel Pearson’s SMILE framework that Dworkin connects with the most. SMILE stands for self-awareness, mastery, impulse control, low probability, and environment. It’s through the lens of SMILE that Dworkin makes sense of how his anxiety has robbed him of his intuition: he lost awareness of his own emotional state (self-awareness), he overestimated the likelihood of complications during surgery (low probability), and his long vacation made the hospital feel like an unfamiliar place (environment). I hadn’t heard of Pearson before this essay, but I have to admit that his website gives off the sort of celebrity-academic vibe that arouses my skepticism.

While the essay focuses on the intuition-anxiety dichotomy, Dworkin touches briefly on another dichotomy, between intuition and science. Intuition is a threat to science, because science is about logic, observation, and measurement to find truth, and intuition is not. Dworkin mentions the incompatibility of science and intuition only in passing before turning back to the theme of the role of intuition is in the work of the professional. The implication here is that professionals face different sorts of problems than scientists do. But I suspect the practice of real science involves a lot more intuition than this stereotyped view of it. I could not help thinking of the “Feynman Problem Solving Algorithm”, so named because it is attributed to the American physicist Richard Feynman.

  1. Write down the problem
  2. Think real hard
  3. Write down the solution

Intuition certainly plays a role in step 2!

Eventually, Dworkin became comfortable again making the sort of high-consequence decisions under uncertainty that are required of a practicing anesthesiologist. As he saw it, his intuition returned. And, though he still experienced some level of doubt about his decisions, he came to realize that there was never a time when his medical decisions had been completely free of doubt: that was an illusion.

In the software operations world, we are often faced with these sorts of potentially high-consequence decisions under uncertainty, especially during incident response. Fortunately for us, the stakes are lower: lives are rarely on the line in the way that they are for doctors, especially when it comes to surgical procedures. But it’s no coincidence that How Complex Systems Fail was also written by an anesthesiologist. As Dr. Richard Cook reminds us in that short paper: all practitioner actions are gambles.

AWS re:Invent talk on their Oct ’25 incident

Last month, I made the following remark on LinkedIn about the incident that AWS experienced back in October.

To Amazon’s credit, there was a deep dive talk on the incident at re:Invent! OK, it wasn’t the keynote, but I’m still thrilled that somebody from AWS gave that talk. Kudos to Amazon leadership for green-lighting a detailed talk on the failure mode, and to Craig Howard in particular for giving this talk.

In my opinion, this talk is the most insightful post-incident public artifact that AWS has ever produced, and I really hope they continue to have these sorts of talks after significant outages in the future. It’s a long talk, but it’s worth it. In particular, it goes into more detail about the failure mode than the original write-up.

Tech that improves reliability bit them in this case

This is yet another example of a reliable system that fails through unexpected behavior of a subsystem whose primary purpose was to improve reliability. In particular, this incident involved the unexpected interaction of the following mechanisms, all of which are there to improve reliability.

  • Multiple enactor instances – to protect against individual enactor instances failing
  • Locking mechanism – to make it easier for engineers to reason about the system behavior
  • Cleanup mechanism – to protect against saturating Route 53 by using up all of the records
  • Transactional mechanism – to protect against the system getting into a bad state after a partial failure (this is “all succeeds or none succeeds”)
  • Rollback mechanism – to be able to recover quickly if a bad plan is deployed

These all sound like good design decisions to me! But in this case, they contributed to an incident, because of an unanticipated interaction with a race condition. Note that many of these are anticipating specific types of failures, but we can never imagine all types of failures, and the ones that we couldn’t imagine are the ones that bite us.

Things that made the incident hard

This talk not only discusses the failure itself, but also the incident response, and what made the incident response more difficult. This was my favorite part of the talk, and it’s the first time I can remember anybody from Amazon talking about the details of incident response like this.

Some of the issues that issues that Howard brought up:

  • They used UUIDs as identifiers for plans, which was difficult for the human operators to work with as compared to more human-readable identifiers
  • There were so many alerts firing that it took them fifteen minutes to look through all of them and find the one that told them what the underlying issue is
  • the logs that were outputted did not make it easy to identify the sequence of events that led to the incidents.

He noted that this illustrates how the “let’s add an alert” approach to dealing with previous incidents can actually hurt you, and that you should think about what will happen in a large future incident, rather than simply reacting to the last one.

Formal modeling and drift

This incident was triggered by a race condition, and race conditions are generally very difficult to identify in development without formal modeling. They had not initialized modeling this aspect of DynamoDB beforehand. When they did build a formal model (using TLA+) after the incident, they discovered that the original design didn’t have this race condition, but later incremental changes to the system did introduce it. This means that, at design time, if they had formally modeled the system, they wouldn’t have caught it then either, because it wasn’t there at design time.

Interestingly, they were able to use AI (Amazon Q, of course) to check correspondence between the model and the code. This gives me some hope that AI might make it a lot easier to keep models and implementation in sync over time, which would increase the value of maintaining these models.

Fumbling towards resilience engineering

Amazon is, well, not well known for embracing resilience engineering concepts.

Listening to this talk, there were elements of it that gestured in the direction of resilience engineering, which is why I enjoyed it so much. I already wrote about how Howard called out elements that made the incident harder. He also talked about how post-incident analysis can take significant time, and it’s a very different type of work than the heat-of-the-moment diagnostic work. In addition, there were some good discussion in the talk about tradeoffs. For example, he talked about caching tradeoffs in the context of negative DNS caching and how that behavior exacerbated this particular incident. He also spoke about how there are broader lessons that others can learn from this incident, even though you will never experience the specific race condition that they did. These are the kinds of topics that the resilience in software community has been going on about for years now. Hopefully, Amazon will get there.

And while I was happy that this talk spent time on the work of incident response, I wish it had gone farther. Despite the recognition earlier in the talk about how incident response was made more difficult by technical decisions, in the lessons learned section at the end, there was no discussion about “how do we design our system to make it easier for responders to diagnose and mitigate when the next big incident happens?”.

Finally, I still grit my teeth whenever I hear the Amazonian term for their post-incident review process: Correction of Error.

Brief thoughts on the recent Cloudflare outage

I was at QCon SF during the recent Cloudflare outage (I was hosting the Stories Behind the Incidents track), so I hadn’t had a real chance to sit down and do a proper read-through of their public writeup and capture my thoughts until now. As always, I recommend you read through the writeup first before you read my take.

All quotes are from the writeup unless indicated otherwise.

Hello saturation my old friend

The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

One thing I hope readers take away from this blog post is the complex systems failure mode pattern that resilience engineering researchers call saturation. Every complex system out there has limits, no matter how robust that system is. And the systems we deal with have many, many different kinds of limits, some of which you might only learn about once you’ve breached that limit. How well a system is able to perform as it approaches one of its limits is what resilience engineering is all about.

Each module running on our proxy service has a number of limits in place to avoid unbounded memory consumption and to preallocate memory as a performance optimization. In this specific instance, the Bot Management system has a limit on the number of machine learning features that can be used at runtime. Currently that limit is set to 200, well above our current use of ~60 features.

In this particular case, the limit was set explicitly.

thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

As sparse as the panic message is, it does explicitly tell you that the problematic call site was an unwrap call. And this is one of the reasons I’m a fan of explicit limits over implicit limits: you tend to get better error messages than when breaching an implicit limit (e.g., of your language runtime, the operating system, the hardware).

A subsystem designed to protect surprisingly inflicts harm

Identify and mitigate automated traffic to protect your domain from bad bots. – Cloudflare Docs

The problematic behavior was in the Cloudflare Bot Management system. Specifically, it was in the bot scoring functionality, which estimates the likelihood that a request came from a bot rather than a human.

This is a system that is designed to help protect their customer from malicious bots, and yet it ended up hurting their customers in this case rather than helping them.

As I’ve mentioned previously, once your system achieves a certain level of reliability, it’s the protective subsystems that end up being things that bite you! These subsystems are a net positive, they help much more than they hurt. But they also add complexity, and complexity introduces new, confusing failure modes into the system.

The Cloudflare case is a more interesting one than the typical instances of this behavior I’ve seen, because Cloudflare’s whole business model is to offer different kinds of protection, as products for their customers. It’s protection-as-a-service, not an internal system for self-protection. But even though their customers are purchasing this from a vendor rather than building it in-house, it’s still an auxiliary system intended to improve reliability and security.

Confusion in the moment

What impressed me the most about this writeup is that they documented some aspects of what it was like responding to this incident: what they were seeing, and how they tried to made sense of it.

In the internal incident chat room, we were concerned that this might be the continuation of the recent spate of high volume Aisuru DDoS attacks:

Man, if I had a nickel every time I saw someone Slack “Is it DDOS?” in response to a surprising surge of errors returned by the system, I could probably retire at this point.

The spike, and subsequent fluctuations, show our system failing due to loading the incorrect feature file. What’s notable is that our system would then recover for a period. This was very unusual behavior for an internal error.

We humans are excellent at recognizing patterns based on our experience, and that generally serves us well during incidents. Someone who is really good at operations can frequently diagnose the problem very quickly just by, say, the shape of a particular graph on a dashboard, or by seeing a specific symptom and recalling similar failures that happened recently.

However, sometimes we encounter a failure mode that we haven’t seen before, which means that we don’t recognize the signals. Or we might have seen a cluster of problems recently that followed a certain pattern, and assume that the latest one looks like the last one. And these are the hard ones.

This fluctuation made it unclear what was happening as the entire system would recover and then fail again as sometimes good, sometimes bad configuration files were distributed to our network. Initially, this led us to believe this might be caused by an attack. 

This incident was one of those hard ones: the symptoms were confusing. The “problem went away, then came back, then went away again, then came back again” type of unstable incident behavior is generally much harder to diagnose than one where the symptoms are stable.

Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence, it led some of the team diagnosing the issue to believe that an attacker may be targeting both our systems as well as our status page.

Here they got bit by a co-incident, an unrelated failure of their status page that led them to believe (reasonably!) that the problem must have been external.

I’m still curious as to what happened with their status page. The error message they were getting mentions CloudFront, so I assume they were hosting their status page on AWS. But their writeup doesn’t go into any additional detail on what the status page failure mode was.

But the general takeaway here is that even the most experienced operators are going to take longer to deal with a complex, novel failure mode, precisely because it is complex and novel! As the resilience engineering folks say, prepare to be surprised! (Because I promise, it’s going to happen).

A plea: assume local rationality

The writeup included a screenshot of the code that had an unhandled error. Unfortunately, there’s nothing in the writeup that tells us what the programmer was thinking when they wrote that code.

In the absence of any additional information, a natural human reaction is to just assume that the programmer was sloppy. But if you want to actually understand how these sorts of incidents actually happen, you have to fight this reaction.

People always make decisions that make sense to them in the moment, based on what they know and what constraints they are operating under. After all, if that wasn’t true, then they wouldn’t have made that decision. The only we can actually understand the conditions that enable incidents, we need to try as hard as we can to put ourselves into the shoes of the person who made that call, to understand what their frame of mind was at the moment.

If we don’t do that, we risk the problem of distancing through differencing. We say, “oh, those devs were bozos, I would never have made that kind of mistake”. This is a great way to limit how much you can learn from an incident.

Detailed public writeups as evidence of good engineering

The writeup produced by Cloudflare (signed by the CEO, no less!) was impressively detailed. It even includes a screenshot of a snippet of code that contributed to the incident! I can’t recall ever reading another public writeup with that level of detail.

Companies generally err on the side of saying less rather than more. After all, if you provide more detail, you open yourself up to criticism that the failure was due to poor engineering. The fewer details you provide, the fewer things people can call you out on. It’s not hard to find people online criticizing Cloudflare online using the details they provided as the basis for their criticism.

Now, I think it would advance our industry if people held the opposite view: the more details that are provided an incident writeup, the higher esteem we should hold that organization. I respect Cloudflare is an engineering organization a lot more precisely because they are willing to provide these sorts of details. I don’t want to hear what Cloudflare should have done from people who weren’t there, I want to hear us hold other companies up to Cloudflare’s standard for describing the details of a failure mode and the inherently confusing nature of incident response.

Quick thoughts on the recent AWS outage

AWS recently posted a public write-up of the us-east-1 incident that hit them this past Monday. Here are a couple of quick thoughts on it.

Reliability → Automation → Complexity → New failure modes

Our industry addresses reliability problems by adding automation so that the system can handle faults automatically. But here’s the thing: adding this sort of automation increases the complexity in the system. This increase in complexity due to more sophisticated automation brings two costs along with it. One cost is that the behavior of the system becomes more difficult to reason about. This is the “what is it currently doing, and why is it doing that?” problem that we operators face. The second cost of the increased complexity is that, while this automation eliminates a known class of failure modes, it simultaneously introduces a new class of failure modes. These new failure modes occur much less frequently than the class of failure modes that were eliminated, but when they do occur, they are potentially much more severe.

According to Amazon’s write-up, the triggering event was the unintentional deletion of DNS records related to the DynamoDB service due to a race condition. Even though DNS records were fully restored by 2:25 AM PDT, it wasn’t until 3:01 PM, over twelve and a half hours later, that Amazon declared that all AWS services had been fully restored.

There were multiple issues that complicated the restoration of different AWS services, but the one I want to call out here involved the Network Load Balancer (NLB) service. Delays in the propagation of network state information led to false health check failures: there were EC2 instances that were healthy, but that the NLB categorized as unhealthy because of the network state issue. From the report:

During the event the NLB health checking subsystem began to experience increased health check failures. This was caused by the health checking subsystem bringing new EC2 instances into service while the network state for those instances had not yet fully propagated. This meant that in some cases health checks would fail even though the underlying NLB node and backend targets were healthy. This resulted in health checks alternating between failing and healthy. This caused NLB nodes and backend targets to be removed from DNS, only to be returned to service when the next health check succeeded.

This pathological health check behavior led to availability zone DNS failovers, which reduced capacity and led to connection errors.

The alternating health check results increased the load on the health check subsystem, causing it to degrade, resulting in delays in health checks and triggering automatic AZ DNS failover to occur. For multi-AZ load balancers, this resulted in capacity being taken out of service. In this case, an application experienced increased connection errors if the remaining healthy capacity was insufficient to carry the application load.

Health checks are a classic example of an automation system that is designed to improve reliability. It’s not uncommon for an instance to go unhealthy for some reason, and being able to automatically detect when that happens and take the instance out of the load balancer means that your system can automatically handle failures in individual instances. But, as we see in this case, the presence of this reliability-improving automation made a particular problem (delay in network propagation state) even worse.

As a result of this incident, Amazon is going to change the behavior of the NLB logic in the case of health check failures.

For NLB, we are adding a velocity control mechanism to limit the capacity a single NLB can remove when health check failures cause AZ failover.

Note that this is yet another increase in automation complexity with the goal of improving reliability! That doesn’t mean that this is a bad corrective action, or that health checks are bad. Instead, my point here is that adding automation complexity to improve reliability always involves a trade-off. It’s very easy to forget about that trade-off if you focus only on the existing reliability problem you’re trying to tackle, and not even consider what new reliability problems you are introducing. Even if those new problems are rare, they can be extremely painful, as AWS can attest to.

I’ve written previously about failures due to reliability-improving automation. The other examples from my linked post are also from AWS incidents, but this phenomenon is in no way specific to AWS.

Surprise should not be surprising

Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with [the DropletWorkflow Manager] without causing further issues.

The Amazon engineers didn’t have a runbook to handle this failure scenario, which meant that they had to improvise a recovery strategy during incident response. This is a recurring theme in large-scale incidents: they involve failures that nobody had previously anticipated. The only thing we can really predict about future high-severity incidents is that they are going to surprise us. We are going to keep encountering failure modes we never anticipated, over and over again.

It’s tempting to focus your reliability engineering resources on reducing the risk of known failure modes. But if you only prepare for the failure scenarios that you can think of, then you aren’t putting yourself in a better position to deal with the inevitable situation that you never imagined would ever happen. And the fact that you’re investing in reliability-improving-but-complexity-increasing automation means that you are planting the seeds of those future surprising failure modes.

This means that if you want to improve reliability, you need to invest in both the complexity-increasing reliability automation (robustness), and also in the capacity to be able to better deal with future surprises (resilience). The resilience engineering researcher David Woods uses the term net adaptive value to describe the ability of a system to deal with both predicted failure modes, and to adapt to effectively unpredicted failure modes.

Part of investing in resilience means building human-controllable leverage points so that engineers have a broad range of mitigation actions available to them during future incidents. That could mean having additional capacity on hand that you can throw at the problem, as well as having built in various knobs and switches. As an example from this AWS incident, part of the engineers’ response was to manually disable the health check behavior.

At 9:36 AM, engineers disabled automatic health check failovers for NLB, allowing all available healthy NLB nodes and backend targets to be brought back into service. This resolved the increased connection errors to affected load balancers.

But having these sorts of knobs available isn’t enough. You need your responders to have the operational expertise necessary to know when to use it. More generally, if you want to get better at dealing with unforeseen failure mode, you need to invest in improving operational expertise, so that your incident responders are best positioned to make sense of the system behavior when faced with a completely novel situation.

The AWS write-up focuses on the robustness improvements, the work they are going to do to be better prepared to prevent a similar failure mode from happening in the future. But I can confidently predict that the next large-scale AWS outage is going to look very different from this one (although it will probably involve us-east-1). It’s not clear to me from the write-up that Amazon has learned the lesson of how it important is to prepare to be surprised.

Caveat promptor

In the wake of a major incident, you’ll occasionally hear a leader admonish the engineering organization that we need to be more careful in the future in order to prevent such incidents from happening in the future. Ultimately, these sorts of admonishments don’t help improve reliability, because they miss an essential truth about the nature of work in organizations.

One of the big ideas from resilience engineering is the efficiency-thoroughness trade-off, also known as the ETTO Principle. The ETTO principle was first articulated by Erik Hollnagel, one of the founders of the field. The idea is that there’s a fundamental trade-off between how quickly we can complete tasks, and how thorough we can be when working on each individual task. Let’s consider the work of doing software development using AI agents through the lens of the ETTO principle.

Coding agents like Claude Code and OpenAI are capable of automatically generating significant amounts of code. Honestly, it’s astonishing what these tools are capable of today. But like all LLMs, while they will always generate plausiblelooking output, they do not always generate correct output. This means that a human needs to check an AI agent’s work to ensure that it’s generating code that’s up to snuff: a human has to review the code generated by the agent.

Screenshot of asking Claude about coding mistakes. Note the permanent warning at the bottom.

As any human software engineer will tell you, reviewing code is hard. It takes effort to understand code that you didn’t write. And larger changes are harder to review, which means that the more work that the agent does, the more work the human in the loop has to do to verify it.

If the code compiles and runs and all tests pass, how much time should the human spend on reviewing it? The ETTO principle tells us there’s a trade-off here: the incentives push software engineers towards completing our development tasks more quickly, which is why we’re all adopting AI in the first place. After all, if it ends up taking just as long to review the AI-generated code as it would have for the human reviewer to write it from scratch, then that defeats the purpose of automating the development task to begin with.

Maybe at first we’re skeptical and we spend more time reviewing the agent code. But, as we get better at working with the agents, and as the AI models themselves get better over time, we’ll figure out where the trouble spots of AI-generated code tend to pop up, and we’ll focus our code review effort accordingly. In essence, we’re riding the ETTO trade-off curve by figuring out how much review effort we should be putting in to and where that effort should go.

Eventually, though, a problem with AI-generated code will slip through this human review process and will contribute to an incident. In the wake of this incident, the software engineers will be reminded that AI agents can make mistakes, and that they need to carefully review the generated code. But, as always, such reminders will do nothing to improve reliability. Because, while AI agents change way that software developers work, they don’t eliminate the efficiency-thoroughness trade-off.

Two thought experiments

Here’s a thought experiment that John Allspaw related to me, in paraphrased form (John tells me that he will eventually capture this in a blog post of his own, at which time I’ll put a proper link).

Consider a small-ish tech company that has four engineering teams (A,B,C,D), where an engineer from Team A was involved in an incident (In John’s telling, the incident involves the Norway problem). In the wake of this incident, a post-incident write-up is completed, and the write-up does a good job of describing what happened. Next, imagine that the write-up is made available to teams A,B, and C, but not to team D. Nobody on team D is allowed to read the write-up, and nobody from the other teams is permitted to speak to team D about the details of the incident. The question is: are the members of team D at a disadvantage compared to the other teams?

The point of this scenario is to convey the intuition that, even though team D wasn’t involved in the incident, its members can still learn something from its details that makes them better engineers.

Switching gears for a moment, let’s talk about the new tools that are emerging under the label AI SRE. We’re now starting to see more tools that leverage LLMs to try to automate incident diagnosis and remediation, such as incident.io’s AI SRE product, Datadog’s Bits AI SRE, Resolve.ai (tagline: Your always-on AI SRE), and Cleric (tagline: AI SRE teammate). These tools work by reading in signals from your organization such as alerts, metrics, Slack messages, and source code repositories.

To effectively diagnose what’s happening in your system, you don’t just want to know what’s happening right now, but you also want to have access to historical data, since maybe there was a similar problem that happened, say, a year ago. While LLMs will have been trained with a lot of general knowledge about software systems, it won’t have been trained on the specific details of your system, and your system will fail in system-specific ways, which means that (I assume!) these AI SRE systems will work better if they have access to historical data about your system.

Here’s second thought experiment, this one my own: Imagine that you’ve adopted one of these AI SRE tools, but the only historical data of the system that you can feed your tool is the collection of your company’s post-incident write-ups. What kinds of details would be useful to an AI SRE tool in helping to troubleshoot future incidents? Perhaps we should encourage people to write their incident reports as if they will be consumed by an AI SRE tool that will use it to learn as much as possible about the work involved in diagnosing and remediating incidents in your company. I bet the humans who read it would learn more that way too.

Fixation: the ever-present risk during incident handling

Recent U.S. headlines have been dominated by school shootings. The bulk of the stories have been about the assassination of Charlie Kirk on the campus of Utah Valley University and the corresponding political fallout. On the same day, there was also a shooting at Evergreen High School in Colorado, where a student shot and injured two of his peers. This post isn’t about those school shootings, but rather, one that happened three years ago. On May 24, 2022, at Robb Elementary School in Uvalde, Texas, 19 students and 2 teachers were killed by a shooter who managed to make his way onto the campus.

Law enforcement were excoriated for how they responded to the Uvalde shooting incident: several were fired, and two were indicted on charges of child endangerment. On January 18, 2024, the Department of Justice released the report on their investigation of the shooting:  Critical Incident Review: Active Shooter at Robb Elementary School. According to the report, there were multiple things that went wrong during the incident. Most significantly, the police originally believed that the shooter had barricaded himself in an empty classroom, where in fact shooter was in a classroom with students. There were also communication issues that resulted in a common ground breakdown during the response. But what I want to talk about in this post is the keys.

The search for the keys

During the response to the Uvalde shooting, there was significant effort by the police on the scene to locate master keys to unlock rooms 111/112 (numbered p14, PDF p48, emphasis mine).

Phase III of the timeline begins at 12:22 p.m., immediately following four shots fired inside classrooms 111 and 112, and continues through the entry and ensuing gunfight at 12:49 p.m. During this time frame, officers on the north side of the hallway approach the classroom doors and stop short, presuming the doors are locked and that master keys are necessary.

The search for keys started before this, because room 109 was locked, and had children in it, and the police wanted to evacuate those children (numbered p 13, PDF p48):

By approximately 12:09 p.m., all classrooms in the hallways have been evacuated and/or cleared except rooms 111/112, where the subject is, and room 109. Room 109 is found to be locked and believed to have children inside.

If you look at the Minute-by-Minute timeline section of the report (numbered p17, PDF p50) you’ll see the text “Events: Search for Keys” appear starting at 12:12 PM, all of the way until 12:45 PM.

The irony here is that the door to room 111/112 may have never been locked to begin with, as suggested by the following quote (numbered p15, PDF p48), emphasis mine:

At around 12:48 p.m., the entry team enters the room. Though the entry team puts the key in the door, turns the key, and opens it, pulling the door toward them, the [Critical Incident Review] Team concludes that the door is likely already unlocked, as the shooter gained entry through the door and it is unlikely that he locked it thereafter.

Ultimately, the report explicitly calls out how the search for the keys led to delays in response (numbered p xxviii, PDF p30):

Law enforcement arriving on scene searched for keys to open interior doors for more than 40 minutes. This was partly the cause of the significant delay in entering to eliminate the threat and stop the killing and dying inside classrooms 111 and 112. (Observation 10)

Fixation

In hindsight, we can see that the responders got something very important wrong in the moment: they were searching for keys for a door that probably wasn’t even locked. In this specific case, there appears to have been some communicated-related confusion about the status of the door, as shown by the following (numbered p53, PDF p86):

The BORTAC [U.S. Border Patrol Tactical Unit] commander is on the phone, while simultaneously asking officers in the hallway about the status of the door to classrooms 111/112. UPD Sgt. 2 responds that they do not know if the door is locked. The BORTAC commander seems to hear that the door is locked, as they say on the phone, “They’re saying the door is locked.” UPD Sgt. 2 repeats that they do not know the status of the door.

More generally, this sort of problem is always going to happen during incidents: we are forever going to come to conclusions during an incident about what’s happening that turn out to be wrong in hindsight. We simply can’t avoid that, no matter how hard we try.

The problem I want to focus on here is not the unavoidable getting it wrong in the moment, but the actually-preventable problem of fixation. We “fixate” when we focus solely on one specific aspect of the situation. The problem here is not searching for keys, but on searching for keys to the exclusion of other activities.

During complex incidents, the underlying problem is frequently not well understood, and so the success of a proposed mitigation strategy is almost never guaranteed. Maybe a rollback will fix things, but maybe it won’t! The way to overcome this problem is to pursue multiple strategies in parallel. One person or group focuses on rolling back a deployment that aligns in time, another looks for other types of changes that occurred around the same time, yet another investigates the logs, another looks into scaling up the amount of memory, someone else investigates traffic pattern changes, and so on. By pursuing multiple diagnostic and mitigation strategies in parallel, we reduce the risk of delaying the mitigation of the incident by blocking on the investigation of one avenue that may turn out to not be fruitful.

Doing this well requires diversity of perspectives and effective coordination. You’re more likely to come up with a broader set of options to pursue if your responders have a broader range of experiences. And the more avenues that you pursue, the more the coordination overhead increases, as you now need to keep the responders up to date about what’s going on in the different threads without overwhelming them with details.

Fixation is a pernicious risk because we’re more likely to fixate when we’re under stress. Since incidents are stressful by nature, they are effectively incubators of fixation. In the heat of the moment, it’s hard to take a breath, step back for a moment, understand what’s been tried already, and calmly ask about what the different possible options are. But the alternative is to tumble down the rabbit hole, searching for keys to a door that is already unlocked.

Nothing fails like a history of success

The Axiom of Experience: the future will be like the past, because, in the past, the future was like the past. – Gerald M. Weinberg, An Introduction to General Systems Thinking

Last Friday, the San Francisco Bay Area Rapid Transit system (known as BART) experienced a multiple hour outage. Later that day, the BART Deputy General Manager released a memo about the outage with some technical details. The memo is brief, but I was honestly surprised to see this amount of detail in a public document that was released so quickly after an incident, especially from a public agency. What I want to focus on in this post is this line (emphasis mine):

Specifically, network engineers were performing a cutover to a new network switch at
Montgomery St. Station… The team had already successfully performed eight similar cutovers earlier this year.

This reminded me of something I read in the Buildkite writeup from an incident that happened back in January of this year (emphasis mine):

Given the confidence gained by initial load testing and the migrations already performed over the past year, we wanted to allow customers to take advantage of their seasonal low periods to perform shard migrations, as a win-win. This caused us to discount the risk of performing migrations during a seasonal low period and what impacts might emerge when regular peak traffic returned.

It also reminded me about the 2022 Rogers Telecommunications outage in Canada (emphasis mine, [redacted] comments in the original):

Rogers had assessed the risk for the initial change of this seven-phased process as “High”. Subsequent changes in the series were listed as “Medium.” [redacted] was “Low” risk based on the Rogers algorithm that weighs prior success into the risk assessment value. Thus, the risk value for [redacted] was reduced to “Low” based on successful completion of prior changes.

Whenever we make any sort of operational change, we have a mental model of the risk associated with the change. We view novel changes (I’ve never done something like this before!) as riskier than changes we’ve performed successfully multiple times in the past (I’ve done this plenty of times). I don’t think this sort of thinking is a fallacy: rather, it’s a heuristic, and it’s generally a pretty effective one! But, like all heuristics, it isn’t perfect. As shown in the examples above, the application of this heuristic can result in a miscalibrated mental model of the risk associated with a change.

So, what’s the broader lesson? In practice, our risk models (implicit or otherwise) are always miscalibrated: a history of past successes is just one of multiple avenues that can lead us astray. Trying to achieve a perfect risk model is like trying to deploy software that is guaranteed to have zero bugs: it’s never going to happen. Instead, we need to accept the reality that, like our code, our models of risk will always have defects that are hidden from us until it’s too late. So we’d better get damned good at recovery.