On work processes and outcomes

Here’s a stylized model of work processes and outcomes. I’m going to call it “Model I”.

If you do work the right way, that is, follow the proper processes, then good things will happen. And, when we don’t, bad things happen. I work in the software world, so by “bad outcome” a mean an incident, and by “doing the right thing”, the work processes typically refer to software validation activities, such as reviewing pull requests, writing unit tests, manually testing in a staging environment. But it also includes work like adding checks in the code for unexpected inputs, ensuring you have an alert defined to catch problems, having someone else watching over your shoulder when you’re making a risky operational change, not deploying your production changes on a Friday, and so on. Do this stuff, and bad things won’t happen. Don’t do this stuff, and bad things will.

If you push someone who believes in this model, you can get them to concede that sometimes nothing bad happens even though someone didn’t do everything can quite right, the amended model looks like this:

Inevitably, an incident happens. At that point, we focus the post-incident efforts on identifying what went wrong with the work. What was the thing that was done wrong? Sometimes, this is individuals who weren’t following the process (deployed on a Friday afternoon!). Other times, the outcome of the incident investigation is a change in our work processes, because the incident has revealed a gap between “doing the right thing” and “our standard work processes”, so we adjust our work processes to close the gap. For example, maybe we now add an additional level of review and approval for certain types of changes.

Here’s an alternative stylized model of work processes and outcomes. I’m going to call it “Model II”.

Like our first model, this second model contains two categories of work processes. But the categories here are different. They are:

What people are officially supposed to
What people actually do

The first categorization is an idealized view of how the organization thinks that people should do their work. But people don’t actually do their work their way. The second category captures what the real work actually is.

This second model of work and outcomes has been embraced by a number of safety researchers. I deliberately called my models as Model I and Model II as a reference to Safety-I and Safety-II. Safety-II is a concept developed by the resilience engineering researcher Dr. Erik Hollnagel. The human factor experts Dr. Todd Conklin and Bob Edwards describe this alternate model using a black-line/blue-line diagram. Dr. Steven Shorrock refers to the first category as work-as-prescribed, and the second category as work-as-done. In our stylized model, all outcomes come from this second category of work, because it’s the only one that captures the actual work that leads to any of the outcomes. (In Shorrock’s more accurate model, the two categories of work overlap, but bear with me here).

This model makes some very different assumptions about the nature of how incidents happen! In particular, it leads to very different sorts of questions.

The first model is more popular because it’s more intuitive: when bad things happen, it’s because we did things the wrong way, and that’s when we look back in hindsight to identify what those wrong ways were. The second model requires us to think more about the more common case when incidents don’t happen. After all, we measure our availability in 9s, which means the overwhelming majority of the time, bad outcomes aren’t happening. Hence, Hollnagel encourages us to spend more time examining the common case of things going right.

Because our second model assumes that what people actually do usually leads to good outcomes, it will lead to different sorts of questions after an incident, such as:

What does normal work look like?
How is it that this normal work typically leads to successful outcomes?
What was different in this case (the incident) compared to typical cases?

Note that this second model doesn’t imply that we should always just keep doing things the same way we always do. But it does imply that we should be humble in enforcing changes to the way work is done, because the way that work is done today actually leads to good outcomes most of the time. If you don’t understand how things normally work well, you won’t see how your intervention might make things worse. Just because your last incident was triggered by a Friday deploy doesn’t mean that banning Friday deploys will lead to better outcomes. You might actually end up making things worse.

Good models protect us from bad models

One of the criticisms leveled at resilience engineering is that the insights that the field generates aren’t actionable: “OK, let’s say you’re right, that complex systems are never perfectly understood, they’re always changing, they generate unexpected interactions, and that these properties explain why incidents happen. That doesn’t tell me what I should do about it!”

And it’s true; I can talk generally about the value of improving expertise so that we’re better able to handle incidents. But I can’t take the model of incidents that I’ve built based on my knowledge of resilience engineering and turn that into a specific software project that you can build and deploy that will eliminate a class of incidents.

But even if these insights aren’t actionable, that they don’t tell us about a single thing we can do or build to help improve reliability, my claim here is that these insights still have value. That’s because we as humans need models to make sense of the world, and if we don’t use good-but-not-actionable models, we can end up with actionable-but-not-good models. Or, as the statistics professor Andrew Gelman put it in his post The social sciences are useless. So why do we study them? Here’s a good reason back in 2021:

The baseball analyst Bill James once said that the alternative to good statistics is not no statistics, it’s bad statistics. Similarly, the alternative to good social science is not no social science, it’s bad social science.

The reason we do social science is because bad social science is being promulgated 24/7, all year long, all over the world. And bad social science can do damage.

Because we humans need models to make sense of the world, incidents models are inevitable. A good-but-not-actionable incident model will feel unsatisfying to people who are looking to leverage these models to take clear action. And it’s all too easy to build not-good-but-actionable models of how incidents happen. Just pick something that you can measure and that you theoretically have control over. The most common example of such a model is the one I’ll call “incidents happen because people don’t follow the processes that they are supposed to.” It’s easy to call out process violations in incident writeups, and it’s easy to define interventions to more strictly enforce processes, such as through automation.

In other words, good-but-not-actionable models protect us from the actionable-but-not-good models. They serve as a kind of vaccine, inoculating us from the neat, plausible, and wrong solutions that H.L. Mencken warned us about.

Model error

One of the topics I wrote about in my last post was about using formal methods to build a model of how our software behaves. In this post, I want to explore how the software we write itself contains models: models of how the world behaves.

The most obvious area is in our database schemas. These schemas enable us to digitally encode information about some aspect of the world that our software cares about. Heck, we even used to refer to this encoding of information into schemas as data models. Relational modeling is extremely flexible: in principle, we can represent just about any aspect of the world into it, if we put enough effort in. The challenge is that the world is messy, and this messiness significantly increases the effort required to build more complete models. Because we often don’t even recognize the degree of messiness the real world contains, we build over-simplified models that are too neat. This is how we end up with issues like the ones captured in Patrick McKenzie’s essay Falsehoods Programmers Believe About Names. There’s a whole book-length meditation on the messiness of real data and how it poses challenges for database modeling: Data and Reality by William Kent, which is highly recommended by Hillel Wayne, in his post Why You Should Read “Data and Reality”.

The problem of missing the messiness of the real world is not at all unique to software engineers. For example, see Christopher Alexander’s A City Is Not a Tree for a critique of urban planners’ overly simplified view of human interactions in urban environments. For a more expansive lament, check out James C. Scott’s excellent book Seeing Like a State. But, since I’m a software engineer and not an urban planner or a civil servant, I’m going to stick to the software side of things here.

Models in the back, models in the front

In particularly, my own software background is in the back-end/platform/infrastructure space. In this space, the software we write frequently implement control systems. It’s no coincidence that both cybernetics and kubernetes derived their names from the same ancient Greek word: κυβερνήτης. Every control system must contain within it a model of the system that it controls. Or, as Roger C. Conant and W. Ross Ashby put it, every good regulator of a system must be a model of that system.

Things get even more complex on the front-end side of the software world. This world must bridge the software world with the human world. In the context of Richard Cook’s framing in Above the Line, Below the Line, the front-end is the line that bridges the two world. As a consequence, the front-end’s responsibility is to expose a model of the software’s internal state to the user. This means that the front-end also has an implicit model of the users themselves. In the paper Cognitive Systems Engineering: New wine in new bottles, Erik Hollnagel and David Woods referred to this model as the image of the operator.

The dangers of the wrongness of models

There’s an oft-repeated quote by the statistician George E.P. Box: “All models are wrong but some are useful”. It’s a true statement, but one that focuses only on the upside of wrong models, the fact that some of them are useful. There’s also a downside to the fact that all models are wrong: the wrongness of these models can have drastic consequences.

And, while It’s a true statement, but what it fails to hint at how bad the consequences can be when a model is wrong. One of my favorite examples involves the 2008 financial crisis, as detailed by the journalist Felix Salmon’s 2009 Wired Magazine article Recipe for Disaster: The Formula that Killed Wall Street. The article described how Wall Street quants used a mathematical model known as the Gaussian copula function to estimate risk. It was a useful model that ultimately led to disaster.

Here’s A ripped-from-the-headlines example of image of the operator model error, how the U.S. national security advisor Mike Waltz accidentally saved the phone number of Jeffrey Goldberg, editor of the Atlantic magazine, to the contact information of White House spokesman Brian Hughes. The source is the recent Guardian story How the Atlantic’s Jeffrey Goldberg got added to the White House Signal group chat:

According to three people briefed on the internal investigation, Goldberg had emailed the campaign about a story that criticized Trump for his attitude towards wounded service members. To push back against the story, the campaign enlisted the help of Waltz, their national security surrogate.

Goldberg’s email was forwarded to then Trump spokesperson Brian Hughes, who then copied and pasted the content of the email – including the signature block with Goldberg’s phone number – into a text message that he sent to Waltz, so that he could be briefed on the forthcoming story.

Waltz did not ultimately call Goldberg, the people said, but in an extraordinary twist, inadvertently ended up saving Goldberg’s number in his iPhone – under the contact card for Hughes, now the spokesperson for the national security council.

…

According to the White House, the number was erroneously saved during a “contact suggestion update” by Waltz’s iPhone, which one person described as the function where an iPhone algorithm adds a previously unknown number to an existing contact that it detects may be related.

The software assumed that, when you receive a text from someone with a phone number and email address, that the phone number and email address belong to the sender. This is a model of the user that turned out to be very, very wrong.

Nobody expects model error

Software incidents involve model errors in one way or another, whether it’s an incorrect model of the system being controlled, an incorrect image of the operator, or a combination of the two.

And, yet, despite us all intoning “all models are wrong, some models are useful”, we don’t internalize that our systems our built on top of imperfect models. This is one of the ironies of AI: we are now all aware of the risks associated with model error with LLMs. We’ve even come up with a separate term for it: hallucinations. But traditional software is just as vulnerable to model error as LLMs are, because our software is always built on top of models that are guaranteed to be incomplete.

You’re probably familiar with the term black swan, popularized by the acerbic public intellectual Nassim Nicholas Taleb. While his first book, Fooled by Randomness, was a success, it was the publication of The Black Swan that made Taleb a household name, and introduced the public to the concept of black swans. While the term black swan was novel, the idea it referred to was not. Back in the 1980s, the researcher Zvi Lanir used a different term: fundamental surprise. Here’s an excerpt of a Richard Cook lecture on the 1999 Tokaimura nuclear accident where he talks about this sort of surprise (skip to the 45 minute mark).

And this Tokaimura accident was an impossible accident.

There’s an old joke about the creator of the first English American dictionary, Noah Webster … coming home to his house and finding his wife in bed with another man. And she says to him, as he walks in the door, she says, “You’ve surprised me”. And he says, “Madam, you have astonished me”.

The difference was that she of course knew what was going on, and so she could be surprised by him. But he was astonished. He had never considered this as a possibility.

And the Tokaimura was an astonishment or what some, what Zev Lanir and others have called a fundamental surprise which means a surprise that is fundamental in the sense that until you actually see it, you cannot believe that it is possible. It’s one of those “I can’t believe this has happened”. Not, “Oh, I always knew this was a possibility and I’ve never seen it before” like your first case of malignant hyperthermia, if you’re a an anesthesiologist or something like that. It’s where you see something that you just didn’t believe was possible. Some people would call it the Black Swan.

Black swans, astonishment, fundamental surprise, these are all synonyms for model error.

And these sorts of surprises are going to keep happening to us, because our models are always wrong. The question is: in the wake of the next incident, will we learn to recognize that fundamental surprises will keep happening to us in the future? Or will we simply patch up the exposed problems in our existing models and move on?

Models, models every where, so let’s have a think

If you’re a regular reader of this blog, you’ll have noticed that I tend to write about two topics in particular:

Resilience engineering
Formal methods

I haven’t found many people who share both of these interests.

At one level, this isn’t surprising. Formal methods people tend to have an analytic outlook, and resilience engineering people tend to have a synthetic outlook. You can see the clear distinction between these two perspectives in the transcript of Leslie Lamport’s talk entitled The Future of Computing: Logic or Biology. Lamport is clearly on the side of the logic, so much so that he ridicules the very idea of taking a biological perspective on software systems. By contrast, resilience engineering types actively look to biology for inspiration on understanding resilience in complex adaptive systems. A great example of this is the late Richard Cook’s talk on The Resilience of Bone.

And yet, the two fields both have something in common: they both recognize the value of creating explicit models of aspects of systems that are not typically modeled.

You use formal methods to build a model of some aspect of your software system, in order to help you reason about its behavior. A formal model of a software system is a partial one, typically only a very small part of the system. That’s because it takes effort to build and validate these models: the larger the model, the more effort it takes. We typically focus our models on a part of the system that humans aren’t particularly good at reasoning about unaided, such as concurrent or distributed algorithms.

The act of creating and explicit model and observing its behavior with a model checker gives you a new perspective on the system being modeled, because the explicit modeling forces you to think about aspects that you likely wouldn’t have considered. You won’t say “I never imagined X could happen” when building this type of formal model, because it forces you to explicitly think about what would happen in situations that you can gloss over when writing a program in a traditional programming language. While the scope of a formal model is small, you have to exhaustively specify the thing within the scope you’ve defined: there’s no place to hide.

Resilience engineering is also concerned with explicit models, in two different ways. In one way, resilience engineering stresses the inherent limits of models for reasoning about complex systems (c.f., itsonlyamodel.com). Every model is incomplete in potentially dangerous ways, and every incident can be seen through the lens of model error: some model that we had about the behavior of the system turned out to be incorrect in a dangerous way.

But beyond the limits of models, what I find fascinating about resilience engineering is the emphasis on explicitly modeling aspects of the system that are frequently ignored by traditional analytic perspectives. Two kinds of models that come up frequently in resilience engineering are mental models and models of work.

A resilience engineering perspective on an incident will look to make explicit aspects of the practitioners’ mental models, both in the events that led up to that incident, and in the response to the incident. When we ask “How did the decision make sense at the time?“, we’re trying to build a deeper understanding of someone else’s state of mind. We’re explicitly trying to build a descriptive model of how people made decisions, based on what information they had access to, their beliefs about the world, and the constraints that they were under. This is a meta sort of model, a model of a mental model, because we’re trying to reason about how somebody else reasoned about events that occurred in the past.

A resilience engineering perspective on incidents will also try to build an explicit model of how work happens in an organization. You’ll often heard the short-hand phrase work-as-imagined vs work-as-done to get at this modeling, where it’s the work-as-done that is the model that we’re after. The resilience engineering perspective asserts that the documented processes of how work is supposed to happen is not an accurate model of how work actually happens, and that the deviation between the two is generally successful, which is why it persists. From resilience engineering types, you’ll hear questions in incident reviews that try to elicit some more details about how the work really happens.

Like in formal methods, resilience engineering models only get at a small part of the overall system. There’s no way we can build complete models of people’s mental models, or generate complete descriptions of how they do their work. But that’s ok. Because, like the models in formal methods, the goal is not completeness, but insight. Whether we’re building a formal model of a software system, or participating in a post-incident review meeting, we’re trying to get the maximum amount of insight for the modeling effort that we put in.

Resilience: some key ingredients

Brian Marick posted on Mastodon the other day about resilience in the context of governmental efficiency. Reading that inspired me to write about some more general observations about resilience.

Now, people use the term resilience in different ways. I’m using resilience here in the following sense: how well a system is able to cope when it is pushed beyond its limits. Or, to borrow a term from safety researcher David Woods, when the system is pushed outside of its competence envelope. The technical term for this sense of the word resilience is graceful extensibility, which also comes from Woods. This term is a marriage of two other terms: graceful degradation, and software extensibility.

The term graceful degradation refers to the behavior of a system which, when it experiences partial failures, can still provide some functionality, even though it’s at a reduced fidelity. For example, for a web app, this might mean that some particular features are unavailable, or that some percentage of users are not able to access the site. Contrast this with a system that just returns 500 errors for everyone whenever something goes wrong.

We talk about extensible software systems as ones that have been designed to make it easy to add new features in the future that were not originally anticipated. A simple example of software extensibility is the ability for old code to call new code, with dynamic binding being one way to accomplish this.

Now, putting those two concepts together, if a system encounters some sort of shock that it can’t handle, and the system has the ability to extend itself so that it can now handle the shock, and it can make these changes to itself quickly enough that it minimizes the harms resulting from the shock, then we say the system exhibits graceful extensibility. And if it can keep extending itself each time it encounters a novel shock, then we say that the system exhibits sustained adaptability.

The rest of this post is about the preconditions for resilience. I’m going to talk about resilience in the context of dealing with incidents. Note that all of the topics described below come from the resilience engineering literature, although I may not always use the same terminology.

Resources

As Brian Marick observed in his toot:

As we discovered with Covid, efficiency is inversely correlated with resilience.

Here’s a question you can ask anyone who works in the compute infrastructure space: “How hot do you run your servers?” Or, even more meaningfully, “How much headroom do your servers have?”

Running your servers “hotter” means running at a higher CPU utilization. This means that you pack more load on fewer servers, which is more efficient. The problem is that the load is variable, which means that the hotter you run the servers, the more likely your server will get overloaded if there is a spike in utilization. An overloaded server can lead to an incident, and incidents are expensive! Running your servers at maximum utilization is running with zero headroom. We deliberately run our servers with some headroom to be able to handle variation in load.

We also see the idea of spare resources in what we call failover scenarios, where there’s a failure in one resource so we switch to using a different resource, such as failing over a database from primary to secondary, or even failing out of a geographical region.

The idea of spare resources is more general than hardware. It applies to people as well. The equivalent of headroom for humans is what Tom DeMarco refers to as slack. The more loaded humans are, the less well positioned they are to handle spikes in their workload. Stuff falls through the cracks when you’ve got too much load, and some of that stuff contributes to incidents. We can also even keep people in reserve for dealing with shocks, such as when an organization staffs a dedicated incident management team.

A common term that the safety people use for spare resources is capacity. I really like the way Todd Conklin put it on his Pre-Accident Investigation Podcast: “You don’t manage risk. You manage the capacity to absorb risk.” Another way he put it is “Accidents manage you, so what you really manage is the capacity for the organization to fail safely.”

Flexibility

Here’s a rough and ready definition of an incident: the system has gotten itself into a bad state, and it’s not going to return to a good state unless somebody does something about it.

Now, by this definition, for the system to become healthy again something about how the system works has to change. This means we need to change the way we do things. The easier it is to make changes to the system, the easier it will be to resolve the incident.

We can think of two different senses of changing the work of the system: the human side and the the software side.

Humans in a system are constrained by a set of rules that exist to reduce risk. We don’t let people YOLO code from their laptops into production, because of a number of risks that would expose us to. But incidents create scenarios where the risks associated with breaking these rules are lower than the risks associated with prolonging the incident. As a consequence, people in the system need the flexibility to be able to break the standard rules of work during an incident. One way to do this is to grant incident responders autonomy, let them make judgments about when they are able to break the rules that govern normal work, in scenarios where breaking the rule is less risky than following it.

Things look different on the software side, where all of the rules are mechanically enforced. For flexibility in software, we need to build into the software functionality in advance that will let us change the way the system behaves. My friend Aaron Blohowiak uses the term Jefferies tubes from Star Trek to describe features that support making operational changes to a system. These were service crawlways that made it easier for engineers to do work on the ship.

A simple example of this type of operational flexibility is putting in feature flags that can be toggled dynamically in order to change system behavior. At the other extreme is the ability to bring up a REPL on a production system in order to make changes. I’ve seen this multiple times in my career, including watching someone use the rails console command of a Ruby on Rails app to resolve an issue.

The technical term in resilience engineering for systems that possess this type of flexibility is adaptive capacity: the system has built up the ability to be able to dynamically reconfigure itself, to adapt, in order to meet novel challenges. This is where the name Adaptive Capacity Labs comes from.

Expertise

In general, organizations push against flexibility because it brings risk. In the case where I saw someone bring up a Ruby on Rails console, I was simultaneously impressed and terrified: that’s so dangerous!

Because flexibility carries risk, we need to rely on judgment as to whether the risk of leveraging the flexibility outweighs the risk of not using the flexibility to mitigate the incident. Granting people the autonomy to make those judgment calls isn’t enough: the people making the calls need to be able to make good judgment calls. And for that, you need expertise.

The people making these calls are having to make decisions balancing competing risks while under uncertainty and time pressure. In addition, how fluent they are with the tools is a key factor. I would never trust a novice with access to a REPL in production. But an expert? By definition, they know what they’re doing.

Diversity

Incidents in complex systems involve interactions between multiple parts of the system, and there’s no one person in your organization who understands the whole thing. To be able to effectively know what to do during an incident, you need to bring in different people who understand different parts of the system in order to help figure out what happens. You need diversity in your responders, people with different perspective on the problem at hand.

You also want diversity in diagnostic and mitigation strategy. Some people might think about recent changes, others might think about traffic pattern changes, others might dive into the codebase looking for clues, and yet others might look to see if there’s another problem going on right now that seems to be related. In addition, it’s often not obvious what the best course of action is to mitigate an incident. Responders often pursue multiple courses of action in parallel, hoping that at least one of them will bring the system healthy again. A diversity of perspectives can help generate more potential interventions, reducing the time to resolve.

Coordination

Having a group of experts with a diverse set of perspectives by itself isn’t enough to deal with an incident. For a system to be resilient, the people within the system need to be able to coordinate, to work together effectively.

If you’ve ever dealt with a complex incident, you know how challenging coordination can be. Things get even hairier in our distributed world. Whether you’re physically located with all of the responders, you’re on a Zoom call (a bridge, as we still say), you’re messaging over Slack, or some hybrid combination of all three, each type of communication channel has its benefits and drawbacks.

There are prescriptive approaches to improving coordination during incidents, such as the Incident Command System (ICS). However, Laura Maguire’s research has shown that, in practice, incident responders intentionally deviate from ICS to better manage coordination costs. This is yet another example of flexibility and expertise being employed to deal with an incident.

The next time you observe an incident, or you reflect on an incident where you were one of the responders, think back on to what extent these ingredients were present or absent. Were you able to leverage spare resources, or did you suffer from not being to? Were there operational changes that people wanted to be able to make during the incident, and were they actually able to make them? Were the responders experienced with the sub-systems they were dealing with, and how did that shape their responses? Did different people come up with different hypotheses and strategies? What is it clear to you what the different responders were doing during the incident? These issues are easy to miss if you’re not looking for them. But, once you internalize them, you’ll never be able to unsee them.

The Canva outage: another tale of saturation and resilience

Today’s public incident writeup comes courtesy of Brendan Humphries, the CTO of Canva. Like so many other incidents that came before, this is another tale of saturation, where the failure mode involves overload. There’s a lot of great detail in Humpries’s write-up, and I recommend you read it directly in addition to this post.

What happened at Canva

Trigger: deploying a new version of a page

The trigger for this incident was Canva deploying a new version of their editor page. It’s notable that there was nothing wrong with this new version. The incident wasn’t triggered by a bug in the code in the new version, or even by some unexpected emergent behavior in the code of this version. No, while the incident was triggered by a deploy, the changes from the previous version are immaterial to this outage. Rather, it was the system behavior that emerged from clients downloading the new version that led to the outage. Specifically, it was clients downloading the new javascript files from the CDN that set the ball in motion.

A stale traffic rule

Canva uses Cloudflare as their CDN. Being a CDN, Cloudflare has datacenters all over the world., which are interconnected by a private backbone. Now, I’m not a networking person, but my basic understanding of private backbones is that CDNs lease fibre-optic lines from telecom companies and use these leased lines to ensure that they have dedicated network connectivity and bandwidth between their sites.

Unfortunately for Canva, there was a previously unknown issue on Cloudflare’s side: Cloudflare Wasn’t using their dedicated fibre-optic lines to route traffic between their Northern Virginia and Singapore datacenters. That traffic was instead, unintentionally, going over the public internet.

[A] stale rule in Cloudflare’s traffic management system [that] was sending user IPv6 traffic over public transit between Ashburn and Singapore instead of its default route over the private backbone.

Traffic between Northern Virginia (IAD) and Singapore (SIN) was incorrectly routed over the public network

The routes that this traffic took suffered from considerable packet loss. For Canva users in Asia, this meant that they experienced massive increases in latency when their web browsers attempted to fetch the javascript static assets from the CDN.

A stale rule like this is the kind of issue that the safety researcher James Reason calls a latent pathogen. It’s a problem that remains unnoticed until it emerges as a contributor to an incident.

High latency synchronizes the callers

Normally, an increase in errors would cause our canary system to abort a deployment. However, in this case, no errors were recorded because requests didn’t complete. As a result, over 270,000+ user requests for the JavaScript file waited on the same cache stream. This created a backlog of requests from users in Southeast Asia.

The first client attempts to fetch the new Javascript files from the CDN, but the files aren’t there yet, the CDN must fetch the files from the origin. Because of the added latency, this takes a long time.

During this time, other clients connect, and attempt to fetch the javascript from the CDN. But the CDN has not yet been populated with the files from the origin, that transfer is still in progress.

As Cloudflare notes in this blog post, when all subsequent clients request access to a file that is in the process of being populated in the cache, they must wait until the file has been cached before they can download the file. Except that Cloudflare has implemented functionality called Concurrent Streaming Acceleration which permits multiple clients to simultaneously download a file that is still in the process of being downloaded from the origin server.

The resulting behavior is that the CDN now behaves effectively as a barrier, with all of the clients slowly but simultaneously downloading the assets. With a traditional barrier, the processes who are waiting can proceed once all processes have entered in the barrier. This isn’t quite the same, as the clients who are waiting can all proceed once the CDN completes downloading the asset from the origin.

The transfer completes, the herd thunders

At 9:07 AM UTC, the asset fetch completed, and all 270,000+ pending requests were completed simultaneously.

20 minutes after Canva deployed the new Javascript assets to the origin server, the clients completed fetching them. The next action the clients take is to call Canva’s API service.

With the JavaScript file now accessible, client devices resumed loading the editor, including the previously blocked object panel. The object panel loaded simultaneously across all waiting devices, resulting in a thundering herd of 1.5 million requests per second to the API Gateway — 3x the typical peak load.

There’s one more issue that made this situation even worse: a known performance issue in the API gateway that was slated to be fixed.

A problematic call pattern to a library reduces service throughput

The API Gateways use an event loop model, where code running on event loop threads must not perform any blocking operations.

Two common threading models for request-response services are thread-per-request and async. For services that are I/O-bound (i.e., most of the time servicing each request is spent waiting for I/O operations to complete, typically networking operations), the async model has the potential to achieve better throughput. That’s because the concurrency of the thread-per-request model is limited by the number of operating-system threads. The async model services multiple requests per thread, and so it doesn’t suffer from the thread bottleneck. Canva’s API gateway implements the async model using the popular Netty library.

One of the drawbacks of the async model is the risk associated with the active thread getting blocked, because this can result in a significant performance penalty. The async model multiplexes multiple requests across an individual thread, and none of those requests can make progress when that thread is blocked. Programmers writing code in a service that uses the async model need to take care to minimize the number of blocking calls.

Prior to this incident, we’d made changes to our telemetry library code, inadvertently introducing a performance regression. The change caused certain metrics to be re-registered each time a new value was recorded. This re-registration occurred under a lock within a third-party library.

In Canva’s case, the API gateway logic was making calls to a third-party telemetry library. They were calling the library in such a way that it took a lock, which is a blocking call. This reduced the effective throughput that the API gateway could handle.

Calls to the library led to excessive thread locking

Although the issue had already been identified and a fix had entered our release process the day of the incident, we’d underestimated the impact of the bug and didn’t expedite deploying the fix. This meant it wasn’t deployed before the incident occurred.

Ironically, they were aware of this problematic call pattern, and they were planning on deploying a fix the day of the incident(!).

As an aside, it’s worth noting the role of telemetry logic behavior in the recent OpenAI incident, and in the locking behavior of tracing library in a complex performance issue that Netflix experienced. Observability giveth reliability, and observability taketh reliability away.

Canva is now in a situation where the API gateway is receiving much more traffic than it was provisioned to handle, is also suffering from a performance regression that reduces its ability to handle traffic even more.

Now let’s look at how the system behaved under these conditions.

The load balancer turns into an overload balancer

Because the API Gateway tasks were failing to handle the requests in a timely manner, the load balancers started opening new connections to the already overloaded tasks, further increasing memory pressure.

A load balancer sits in front of a service and distributes the incoming requests across the units of compute. Canva runs atop ECS, so the individual units are called tasks, and the group is called a cluster (you can think of these as being equivalent to pods and replicasets in Kubernetes-land).

The load balancer will only send requests to a task that is healthy. If a task is unhealthy, then it stops being considered as a candidate target destination for the load balancer. This yields good results if the overall cluster is provisioned to handle the load: the traffic gets redirected away from the unhealthy tasks and onto the healthy ones.

Load balancer only sells traffic to the healthy tasks

But now consider the scenario where all of the tasks are operating close to capacity. As tasks go unhealthy, the load balancer will redistribute the load to the remaining “healthy” tasks, which increases the likelihood those tasks gets pushed into an unhealthy state.

Redirecting traffic to the almost-overloaded healthy nodes will push them over

This is a classic example of a positive feedback loop: the more tasks go unhealthy, the more traffic the healthy nodes received, the more likely those tasks will go unhealthy as well.

Autoscaling can’t keep pace

So, now the system is saturated, and the load balancer is effectively making things worse. Instead of shedding load, it’s concentrating load on the tasks that aren’t overloaded yet.

Now, this is the cloud, and the cloud is elastic, and we have a wonderful automation system called the autoscaler that can help us in situations of overload by automating provisioning new capacity.

Only, there’s a problem here, and that’s that the autoscaler simply can’t scale up fast enough. And the reason it can’t scale up fast enough is because of another automation system that’s intended to help in times of overload: Linux’s OOM killer.

The growth of off-heap memory caused the Linux Out Of Memory Killer to terminate all of the running containers in the first 2 minutes, causing a cascading failure across all API Gateway tasks. This outpaced our autoscaling capability, ultimately leading to all requests to canva.com failing.

Operating systems need access to free memory in order to function properly. When all of the memory is consumed by running processes, the operating system runs into trouble. To guard against this, Linux has a feature called the OOM killer which will automatically terminate a process when the operating system is running too low on memory. This frees up memory, enabling the OS to keep functioning.

So, you have the autoscaler which is adding new tasks, and the OOM killer which is quickly destroying existing tasks that have become overloaded.

It’s notable that Humphries uses the term outpaced. This sort of scenario is a common failure mode in complex system failures, where the system gets into a state where it can’t keep up. This phenomenon is called decompensation. Here’s resilience engineering pioneer David Woods describing decompensation on John Willis’s Profound Podcast:

And lag is really saturation in time. That’s what we call decompensation, right? I can’t keep pace, right? Events are moving forward faster. Trouble is building and compounding faster than I, than the team, than the response system can decide on and deploy actions to affect. So I can’t keep pace. – David Woods

Adapting the system to bring it back up

At this point, the API gateway cluster is completely overwhelmed. From the timeline:

9:07 AM UTC – Network issue resolved, but the backlog of queued requests result in a spike of 1.5 million requests per second to the API gateway.

9:08 AM UTC – API Gateway tasks begin failing due to memory exhaustion, leading to a full collapse.

When your system is suffering from overload, there are basically two strategies:

increase the capacity
reduce the load

Wisely, the Canva engineers pursued both strategies in parallel.

Max capacity, but it still isn’t enough

Montgomery Scott, my nominee for patron saint of resilience engineering

We attempted to work around this issue by significantly increasing the desired task count manually. Unfortunately, it didn’t mitigate the issue of tasks being quickly terminated.

The engineers tried to increase capacity manually, but even with the manual scaling, the load was too much: the OOM killer was taking the tasks down too quickly for the system to get back to a healthy state.

Load shedding, human operator edition

The engineers had to improvise a load shedding solution in the moment. The approach they took was to block traffic the CDN layer, using Cloudflare.

At 9:29 AM UTC, we added a temporary Cloudflare firewall rule to block all traffic at the CDN. This prevented any traffic reaching the API Gateway, allowing new tasks to start up without being overwhelmed with incoming requests. We later redirected canva.com to our status page to make it clear to users that we were experiencing an incident.

It’s worth noting here that while Cloudflare contributed to this incident with the stale rule, the fact that they could dynamically configure Cloudflare firewall rules meant that Cloudflare also contributed to the mitigation of this incident.

Ramping the traffic back up

Here they turned off all of their traffic to give their system a chance to go back to healthy. But a healthy system under zero load behaves differently from a healthy system under typical load. If you just go back from zero to typical, there’s a risk that you push the system back into an unhealthy state. (One common problem is that autoscaling will have scaled down multiple services due when there’s no load).

Once the number of healthy API Gateway tasks stabilized to a level we were comfortable with, we incrementally restored traffic to canva.com. Starting with Australian users under strict rate limits, we gradually increased the traffic flow to ensure stability before scaling further.

The Canva engineers had the good judgment to ramp up the traffic incrementally rather than turn it back on all at once. They started restoring at 9:45 AM UTC, and were back to taking full traffic at 10:04 AM.

Some general observations

All functional requirements met

I always like to call out situations where, from a functional point of view, everything was actually working fine. In this case, even though there was a stale rule in the Cloudflare traffic management system, and there was a performance regression in the API gateway, everything was working correctly from a functional perspective: packets were still being routed between Singapore and Northern Virginia, and the API gateway was still returning the proper responses for individual requests before it got overloaded.

Rather, these two issues were both performance problems. Performance problems are much harder to spot, and the worst are the ones that you don’t notice until you’re under heavy load.

The irony is that, as an organization gets better at catching functional bugs before they hit production, more and more of the production incidents they face will be related to these more difficult-to-detect-early performance issues.

Automated systems made the problem worse

There were a number of automated systems in play whose behavior made this incident more difficult to deal with.

The Concurrent Streaming Acceleration functionality synchronized the requests from the clients. The OOM killer reduced the time it took for a task to be seen as unhealthy by the load balancer, and the load balancer in turn increased the rate at which tasks went unhealthy.

None of these systems were designed to handle this sort of situation, so they could not automatically change their behavior.

The human operators changed the way the system behaved

It was up to the incident responders to adapt the behavior of the system, to change the way it functioned in order to get it back to a healthy state. They were able to leverage an existing resource, Cloudflare’s firewall functionality, to accomplish this. Based on the description of the action items, I suspect they had never used Cloudflare’s firewall to do this type of load shedding before. But it worked! They successfully adapted the system behavior.

We’re building a detailed internal runbook to make sure we can granularly reroute, block, and then progressively scale up traffic. We’ll use this runbook to quickly mitigate any similar incidents in the future.

This is a classic example of resilience, of acting to reconfigure the behavior of your system when it enters a state that it wasn’t originally designed to handle.

As I’ve written about previously, Woods talks about the idea of a competence envelope. The competence envelope is sort of a conceptual space of the types of inputs that your system can handle. Incidents occur when your system is pushed to operate outside of its competence envelope, such as when it gets more load than it is provisioned to handle:

The competence envelope is a good way to think about the difference between robustness and resilience. You can think of robustness as describing the competence envelope itself: a more robust system may have a larger competence envelope, it is designed to handle a broader range of problems.

However, every system has a finite competence envelope. The difference between a resilient and a brittle system is how that system behaves when it is pushed just outside of its competence envelope.

Incidents happen when the system is pushed outside of its competence envelope

A resilient system can change the way it behaves when pushed outside of the competence envelope due to an incident in order to extend the competence envelope so that it can handle the incident. That’s why we say it has adaptive capacity. On the other hand, a brittle system is one that cannot adapt effectively when it exceeds its competence envelope. A system can be very robust, but also brittle: it may be able to handle a very wide range of problems, but when it faces a scenario it wasn’t designed to handle, it can fall over hard.

The sort of adaptation that resilience demands requires human operators: our automation simply doesn’t have a sophisticated enough model of the world to be able to handle situations like the one that Canva found itself in.

In general, action items after an incident focus on expanding the competence envelope: making changes to the system to handle the scenario that just happened. Improving adaptive capacity involves different kind of work than improving system robustness.

We need to build in the ability to reconfigure our systems in advance, without knowing exactly what sorts of changes we’ll need to make. The Canva engineers had some powerful operational knobs at their disposal through the Cloudflare firewall configuration. This allowed them to make changes. The more powerful and generic these sorts of dynamic configuration features are, the more room for maneuver we have. Of course, dynamic configuration is also dangerous, and is itself a contributor to incidents. Too often we focus solely on the dangers of such functionality in creating incidents, without seeing its ability to help us reconfigure the system to mitigate incidents.

Finally, these sorts of operator interfaces are of no use if the responders aren’t familiar with them. Ultimately, the more your responders know about the system, the better position they’ll be in to implement these adaptations. Changing an unhealthy system is dangerous: no matter how bad things are, you can always accidentally make things worse. The more knowledge about the system you can bring to bear during an incident, the better position you’ll be in to adaptive your system to extend that competence envelope.

TTR: the out-of-control metric

I’m currently reading The Machine That Changed The World. This is a book written back in 1990 comparing Toyota’s approach to automobile manufacturing to the approach used by American car manufacturers. It’s one of the earlier books that popularized the concept of lean manufacturing in the United States.

The software world has drawn a lot of inspiration from lean manufacturing over the past two decades, as is clear from the titles of influential software books such as Implementing Lean Software Development by Tom Poppendieck and Mary Poppendieck (2006), The Principles of Product Development Flow: Second Generation Lean Product Development by Don Reinersten (2009), The Lean Startup by Eric Ries (2011), Lean UX by Jeff Gothelf and Josh Sieden (first published in 2013), and Accelerate: The Science of Lean Software by Nicole Forsgren PhD, Jez Humble, and Gene Kim (2018). Another signal is the proliferation of Kanban boards, which are a concept taken from Toyota. I’ve also seen continuous delivery compared to single-piece flow from lean manufacturing, although I suspect that’s more a case of convergent evolution than borrowing.

In The Machine That Changed The World, the authors mention in passing how Toyota uses the five-whys problem identification technique. I had forgotten that five whys has its origins in manufacturing. This post isn’t about five whys, but it is about how applying concepts from manufacturing to incidents can lead us astray, because of assumptions that turn out to be invalid. For that, I’m going to turn to W. Edwards Deming and the idea of statistical control.

Deming & Statistical control

Deming is the famous American statistician who had enormous influence on the Japanese manufacturing industry in the second half of the twentieth century. My favorite book of his is Out of the Crisis, originally published in 1982, which I highly recommend.

One of the topics Deming wrote on was about a process being under statistical control, with the focus of his book being on manufacturing processes in particular. For example, imagine you’re tracking some metric of interest (e.g., defect rate) for a manufacturing process.

(Note: I have no experience in the manufacturing domain, so you should treat this is as a stylized, cartoon-ish view of things).

Deming argued that when a process is under statistical control, focusing on individual defects, or even days, where the defects are higher than average, is a mistake. To make this more concrete, you can compute an upper control limit and lower control limit based on the statistics of the observed data. There is variation inherent in the process, and focusing on the individual data points that happen to be higher than the average won’t lead to actual improvements.

The process with computed upper and lower control limits. This graph is sometimes called as a control chart.

Instead, in order to make an improvement, you need to make a change to the overall system. This is where Toyota’s five-whys would come in, where you’d identify a root cause, a systemic issue behind why the average rate is as high at is. Once you identified a root cause, you’d apply what Deming called the Plan-Do-Check-Act cycle, where you’d come up with an intervention, apply it, observe whether the intervention has actually achieved the desired improvement, and then react accordingly.

I think people have attempted to apply these concepts to improving availability, where time-to-resolve (TTR) is the control metric. But it doesn’t work the way it does in manufacturing. And the reason it doesn’t has everything to do with the idea of statistical control.

Out of control

Now, let’s imagine a control chart that looks a little different.

In the chart above, there are multiple points that are well outside the control limits. This is a process that is not under statistical control.

Deming notes that, when a process is not under statistical control, statistics associated with the process are meaningless:

Students are not warned in classes nor in the books that for analytic purposes (such as to improve a process), distributions and calculations of mean, mode, standard deviation, chi-square, t-test, etc. serve no useful purpose for improvement of a process unless the data were produced in a state of statistical control. – W. Edwards Deming, Out of the Crisis

Now, I’m willing to bet that if you were to draw a control chart for the time-to-resolve (TTR) metric for your incidents, it would look a lot more like the second control chart than the first one, that you’d have a number of incidents whose TTRs are well outside of the upper control limit.

The reason I feel confident saying this is because when an incident is happening, your system is out of control. This actually is a decent rough-and-ready definition of an incident: an event when your system goes out of control.

Time-to-resolve is a measure of how long your system was out of control. But because your system was out of control, then it isn’t a meaningful metric to perform statistical analysis on. As per the Deming quote above, mean-time-to-resolve (MTTR) serves no useful purpose for improvement.

Anyone who does operations work will surely sympathize with the concept that “a system during an incident is not under statistical control”. Incidents are often chaotic affairs, and individual events (a chance ordering by the thread scheduler, a responder who happens to remember a recent Slack message, someone with important knowledge happens to be on PTO) can mean the difference between a diagnosis that takes minutes versus one that takes hours.

As John Allspaw likes to say, a large TTR cannot distinguish between a complex incident handled well and a simple incident handled poorly. There are too many factors that can influence TTR to conclude anything useful from the metric alone.

Conclusion

To recap:

When a system is out of control, statistical analysis on system metrics are useless as a signal for improving the system.
Incidents are, by definition, events when the system is out control.

TTR, in particular, is a metric that only applies when the system is out of control. It’s really just a measure of how long the system was out of control.

Now, this doesn’t mean that we should throw up our hands and say “we can’t do anything to improve our ability to resolve incidents.” It just means that we need to let go of a metrics-based approach.

Think back to Allspaw’s observation: was your recent long incident a complex one handled well or a simple one handled poorly? How would you determine that? What questions would you ask?

If you don’t examine what worked, how will you know what works?

This is one of my favorite bits from fellow anglophone Québécois Norm McDonald:

Norm: not a lung expert

One of the goals I believe that we all share for post-incident work is to improve the system. For example, when I wrote the post Why I don’t like discussing action items during incident reviews, I understood why people would want to focus on action items: precisely because they share this goal of wanting to improve the system (As a side note, Chris Evans of incident.io wrote a response: Why I like discussing actions items in incident reviews). However, what I want to write about here is not the discussion of action items, but focusing on what went wrong versus what went right.

“How did things go right?”

How did things go right is a question originally posed by the safety researcher Erik Hollnagel, in his the safety paradigm that he calls Safety-II. The central idea is that things actually go right most of the time, and if you want to actually improve the system, you need to get a better understanding of how the system functions, which means you need to broaden your focus beyond the things that broke.

You can find an approachable introduction to Safety-II concepts in the EUROCONTROL white paper From Safety-I to Safety-II. Hollnagel’s ideas have been very influential in the resilience engineering community. As an example, check out my my former colleague Ryan Kitchens’s talk at SREcon Americas 2019: How Did Things Go Right? Learning More from Incidents.

It’s with this how did things go right lens that I want to talk a little bit about incident review.

Beyond “what went well”

Now, in most incident writeups that I’ve read, there is a “what went well” section. However, it’s typically the smallest section in the writeup, with maybe a few bullet points: there’s never any real detail there.

Personally, I’m looking for details like how an experienced engineer recognized the symptoms enough to get a hunch about where to look next, reducing the diagnostic time by hours. Or how engineers leveraged an operational knob that was originally designed for a different purpose. I want to understand how experts are able to do the work of effectively diagnosing problems, mitigating impact, and remediating the problem.

Narrowly, I want to learn this because I want to get this sort of working knowledge into other people’s heads. More broadly, I want to bring to light the actual work that gets done.

We don’t know how the system works

Humans adapt to the constraints they face in order to get their work done. Look for these adaptations if you want to understand the work better.
— @norootcause.surfingcomplexity.com on Bluesky (@norootcause) October 13, 2024

Safety researchers make a distinction between work-as-imagined and work-as-done. We think we understand how the day-to-day work gets done, but we actually don’t. Not really. To take an example from software, we don’t actually know how people really use the tooling to get their work done, and I can confirm this by being on-call for internal support for development tools in previous jobs. (“You’re using our tool to do what?” is not an uncommon reaction from the on-call person). People do things we never imagined, in both wonderful and horrifying ways (sometimes at the same time!).

We also don’t see all of the ways that people coordinate to get their work done. There are the meetings, the slack messages, the comments on the pull requests, but there’s also the shared understanding, the common knowledge, the stuff that everybody knows that everybody else knows, that enables people to get this work done, while reducing the amount of explicit communication that has to happen.

What’s remarkable is that these work patterns, well, they work. These people in your org are able to get their stuff done, almost all of the time. Some of them may exhibit mastery of the tooling, and others may use the tooling in ways even it was never intended that are fundamentally unsafe. But we’re never going to actually know unless we actually look at how they’re doing their work.

Because how people do their work is how the system works. And if we’re going to propose and implement interventions, it’s very likely that the outcomes of the interventions will surprise us, because these changes might disrupt effective ways of doing work, and people will adapt to those interventions in ways we never anticipated, and in ways we may never even know if we don’t take a look.

Then why use incidents to look at things that go right?

At first glance, it does seem odd to use incidents as the place to examine where work goes well: given that incidents are times where something unquestionably went wrong. It would be wonderful if we could study how work happens when things are going well. Heck, I’d love to see companies have sociologists or anthropologists on staff to study how the work happens at the company. Regrettably, though, incidents are one of the only times when the organization is actually willing to devote resources (specifically, time) on examining work in fine-grained detail.

We can use incidents to study how things go well, but we have to keep a couple of things in mind. One, we need to recognize that adaptations that fail led to an incident are usually successful, which is why people developed those adaptations. Note that because an adaptation usually works, doesn’t mean that it’s a good thing to keep doing: an adaptation could be a dangerous workaround to a constraint like a third-party system that can’t be changed directly and so must be awkwardly worked around.

Second, we need to look in more detail, to remark, at incident response that is remarkable. When incident response goes well, there is impressive diagnostic, coordination, and improvisation work to get the system back to healthy. These are the kinds of skills you want to foster across your organization. If you want to build tools to make this work even better, you should take the time to understand just how this work is done today. Keep this in mind when you’re proposing new interventions. After all, if you don’t examine what worked, how will you know what works?

Safety first!

I’m sure you’ve heard the slogan “safety first”. It is a statement of values for an organization, but let’s think about how to define what it should mean explicitly. Here’s how I propose to define safety first, in the context of a company. I’ll assume the company is in the tech (software) industry, since that’s the one I know best. So, in this context, you can think of “safety” as being about avoiding system outages, rather than about, say, avoiding injuries on a work site.

Here we go:

A tech company is a safety first company if any engineer has the ability to extend a project deadline, provided that the engineer judges in the moment that they need additional time in order to accomplish the work more safely (e.g., by following an onerous procedure for making a change, or doing additional validation work that is particular time-intensive).

This ability to extend the deadline must be:

automatic
unquestioned
consequence-free

Automatic. The engineer does not to explicitly ask someone else for permission before extending the deadline.

Unquestioned. Nobody is permitted to ask the engineer “why did you extend the deadline?” after-the-fact.

Consequence-free. This action cannot be held against the engineer. For example, it cannot be a factor in a performance review.

Now, anyone who has worked in management would say to me, “Lorin, this is ridiculous. If you give people the ability to extend deadlines without consequence, then they’re just going to use this constantly, even if there isn’t any benefit to safety. It’s going to drastically harm the organization’s ability to actually get anything done”.

And, the truth is, they’re absolutely right. We all work under deadlines, and we all know that if there was a magical “extend deadline” button that anyone could press, that button would be pressed a lot, and not always for the purpose of improving safety. Organizations need to execute, and if anybody could introduce delays, this would cripple execution.

But this response is exactly the reason why safety first will always be a lie. Production pressure is an unavoidable reality for all organizations. Because of this, the system will always push back against delays, and that includes delays for the benefit of safety. This means engineers will always face double binds, where they will feel pressure to execute on schedule, but will be punished if they make decisions that facilitate execution but reduce safety.

Safety is never first in organization: it’s always one of a number of factors that trade off against each other. And those sorts of tradeoff decisions happen day-to-day and moment-to-moment.

Remember that the next time someone is criticized for “not being careful enough” after a change brings down production.

The “CrowdStrike” approach to reliability work

There’s a lot we simply don’t know about how reliability work was prioritized inside of CrowdStrike, but I’m going to propose a little thought experiment about the incident where I make some assumptions.

First, let’s assume that the CrowdStrike incident was the first time they had an incident that was triggered by a Rapid Response Content update, which is a config update. We’ll assume that previous sensor issues that led to Linux crashes were related to a sensor release, which is a code update.

Next, let’s assume that CrowdStrike focuses their reliability work on addressing the identified root cause of previous incidents.

Finally, let’s assume that none of the mitigations documented in their RCA were identified as action items that addressed the root cause of any incidents they experienced before this big one.

If these three assumptions are true, then it explains why these mitigations weren’t done previously. Because they didn’t address the root causes of previous incidents, and they focus their post-incident work on identifying the root cause of previous incidents. Now, I have no idea if any of these assumptions are actually true, but they sound plausible enough for this thought experiment to hold.

This thought experiment demonstrates the danger of focusing post-incident work on addressing the root causes of previous incidents: it acts to obscure other risks in the system that don’t happen to fit into the root cause analysis. After all, issues around validation of the channel files or staging of deploys were not really the root cause of any of the incidents before this one. The risks that are still in your system don’t care about what you have labeled “the real root cause” of the previous incident, and there’s no reason to believe that whatever gets this label is the thing that is most likely to bite you in the future.

I propose (cheekily) to refer to the prioritize-identifying-and-addressing-the-root-cause–of-previous-incidents thinking as the “CrowdStrike” approach to reliability work.

I put “CrowdStrike” in quotes because, in a sense, this really isn’t about them at all: I have no idea if the assumptions in this thought experiment are true. But my motivation for using this phrase is more about using CrowdStrike as a symbol that’s become salient to our industry then about the particular details of that company.

Are you on the lookout for the many different signals of risk in your system, or are you taking the “CrowdStrike” approach to reliability work?