The “CrowdStrike” approach to reliability work

There’s a lot we simply don’t know about how reliability work was prioritized inside of CrowdStrike, but I’m going to propose a little thought experiment about the incident where I make some assumptions.

First, let’s assume that the CrowdStrike incident was the first time they had an incident that was triggered by a Rapid Response Content update, which is a config update. We’ll assume that previous sensor issues that led to Linux crashes were related to a sensor release, which is a code update.

Next, let’s assume that CrowdStrike focuses their reliability work on addressing the identified root cause of previous incidents.

Finally, let’s assume that none of the mitigations documented in their RCA were identified as action items that addressed the root cause of any incidents they experienced before this big one.

If these three assumptions are true, then it explains why these mitigations weren’t done previously. Because they didn’t address the root causes of previous incidents, and they focus their post-incident work on identifying the root cause of previous incidents. Now, I have no idea if any of these assumptions are actually true, but they sound plausible enough for this thought experiment to hold.

This thought experiment demonstrates the danger of focusing post-incident work on addressing the root causes of previous incidents: it acts to obscure other risks in the system that don’t happen to fit into the root cause analysis. After all, issues around validation of the channel files or staging of deploys were not really the root cause of any of the incidents before this one. The risks that are still in your system don’t care about what you have labeled “the real root cause” of the previous incident, and there’s no reason to believe that whatever gets this label is the thing that is most likely to bite you in the future.

I propose (cheekily) to refer to the prioritize-identifying-and-addressing-the-root-cause–of-previous-incidents thinking as the “CrowdStrike” approach to reliability work.

I put “CrowdStrike” in quotes because, in a sense, this really isn’t about them at all: I have no idea if the assumptions in this thought experiment are true. But my motivation for using this phrase is more about using CrowdStrike as a symbol that’s become salient to our industry then about the particular details of that company.

Are you on the lookout for the many different signals of risk in your system, or are you taking the “CrowdStrike” approach to reliability work?

Expect it most when you expect it least

Homer Simpson saying "It's probably the person you least suspect." — Homer Simpson: philosopher

Yesterday, CrowdStrike released a Preliminary Post-Incident Review of the major outage that happened last week. I’m going to wait until the full post-incident review is released before I do any significant commentary, and I expect we’ll have to wait at least a month for that. But I did want to highlight one section of the doc from the section titled What Happened on July 19, 2024, emphasis mine

On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.

And now, let’s reach way back into the distant past of three weeks ago, when the The Canadian Radio-television and Telecommunications Commission (CRTC) posted an executive summary of a major outage, which I blogged about at the time. Here’s the part I want to call attention to, once again, emphasis mine.

Rogers had initially assessed the risk of this seven-phased process as “High.” However, as changes in prior phases were completed successfully, the risk assessment algorithm downgraded the risk level for the sixth phase of the configuration change to “Low” risk, including the change that caused the July 2022 outage.

In both cases, the engineers had built up confidence over time that the types of production changes they were making were low risk.

When we’re doing something new with a technology, we tend to be much more careful with it, it’s riskier, we’re shaking things out. But, over time, after there haven’t been any issues, we start to gain more trust in the tech, confident that it’s a reliable technology. Our internal perception of the risk adjusts based on our experience, and we come to believe that the risks of these sorts of changes are low. Any organization who concentrates their reliability efforts on action items in the wake of an incident, rather than focusing on the normal work that doesn’t result in incidents, is implicit making this type of claim. The squeaky incident gets the reliability grease. And, indeed, it’s rational to allocate your reliability effort based on your perception of risk. Any change can break us, but we can’t treat every change the same. How could it be otherwise?

The challenge for us is that large incidents are not always preceded by smaller ones, which means that there may be risks in your system that haven’t manifested as minor outages. I think these types of risks are the most dangerous ones of all, precisely because they’re much harder for the organization to see. How are you going to prioritize doing the availability work for a problem that hasn’t bitten you yet, when your smaller incidents demonstrate that you have been bitten by so many other problems?

This means that someone has to hunt for weak signals of risk and advocate for doing the kind of reliability work where there isn’t a pattern of incidents you can point to as justification. The big ones often don’t look like the small ones, and sometimes the only signal you get in advance is a still, small sound.

A still, small sound

One of the benefits of having attended a religious elementary school (despite being a fairly secular one as religious schools go) is exposure to the text of the Bible. There are two verses from the first book of Kings that have always stuck with me. They are from I Kings 19:11–13, where God is speaking to the prophet Elijah. I’ve taken the text below from the English translation that was done by A.J. Rosenberg:

And He said: Go out and stand in the mountain before the Lord, behold!

The Lord passes, and a great and strong wind splitting mountains and shattering boulders before the Lord, but the Lord was not in the wind.

And after the wind an earthquake-not in the earthquake was the Lord.

After the earthquake fire, not in the fire was the Lord,

and after the fire a still small sound.

I was thinking of these passages when listening to an episode of Todd Conklin’s Pre-accident Investigation podcast, with the episode title Safety Moment – Embracing the Yellow: Rethinking Safety Indicators.

It’s in the aftermath of the big incidents where most of the attention is focused when it comes to investing in reliability: the great and strong winds, the earthquakes, the fires. But I think it’s the subtler signals, what Conklin refers to as the yellow (as opposed to the glaring signals of the red or the absence of signal of the green) that are the ones that communicate to us about the risk of the major outage that has not happened yet.

It’s these still small sounds we need to able to hear in order to see the risks that could eventually lead to catastrophe. But to hear these sounds, we need to listen for them.

Book review: How Life Works

In the 1980s, the anthropologist Lucy Suchman studied how office workers interacted with sophisticated photocopiers. What she found was that people’s actions were not determined by predefined plans. Instead, people decided what act to take based on the details of the particular situation they found themselves in. They used predefined plans as resources for helping them choose which action to take, rather than as a set of instructions to follow.

I couldn’t help thinking of Suchman when reading How Life Works. In it, the British science writer Philip Ball presents a new view of the role of DNA, genes, and the cell in the field of biology. Just as Suchman argued that people use plans as resources rather than explicit instructions, Ball discusses how the cell uses DNA as resources. Our genetic code is a toolbox, not a blueprint.

Imagine you’re on a software team that owns a service, and an academic researcher who is interested in software but doesn’t really know anything about it comes to and asks, “What function does redis play in your service? What would happen if it got knocked out?”. This is a reasonable question, and you explain the role that redis plays in improving performance through caching. And then he asks another question: “What function does your IDE’s debugger play in your service?” He notices the confused look on your face and tries to clarify the question by asking, “Imagine another team had to build the same service, but they didn’t have the IDE debugger? How would the behavior of the service be different? Which functions would be impaired” And you try to explain that you don’t actually know how it would be different. That the debugger, unlike redis, is a tool, which is sometimes used to help diagnose problems. But there are multiple ways to debug (for example, using log statements). It might not make any difference at all if that new team doesn’t happen to use the debugger. There’s no direct mapping of the debugger’s presence to the service’s functionality: the debugger doesn’t play a functional role in the service the way that redis does. In fact, the next team that builds a service might not end up needing to use the debugger at all, so removing it might have no observable effect on the next service.

The old view sees these DNA segments like redis, having an explicit functional role, and the new view sees them more like a debugger, as tools to support the cell in performing functions. As Ball puts it, “The old view of genes as distinct segments of DNA strung along the chromosomes like beads, interspersed with junk, and each controlling some aspect of phenotype, was basically a kind of genetic phrenology.” The research has shown that the story is more complex than that, and that there is no simple mapping between DNA segments in our chromosomes and observed traits, or phenotypes. Instead, these DNA segments are yet another input in a complex web of dynamic, interacting components. Instead of focusing on these DNA strands of our genome, Ball directs our attention on the cell as a more useful unit of analysis. A genome, he points out, is not capable of constructing a cell. Rather, a cell is always the context that must exist for the genome to be able to do anything.

The problem space that evolution works in is very different from the one that human engineers deal with, and, consequently, the solution space can appear quite alien to us. The watch has long been a metaphor for biological organisms (for example, Dawkins’s book “The Blind Watchmaker”), but biological systems are not like watches with their well-machined gears. The micro-world of the cell contains machinery at the scale of molecules, which is a very noisy place. Because biological systems must be energy efficient, they function close to the limits of thermal noise. That requires a very different types of machines than the ones we interact with at our human scales. Biology can’t use specialized parts with high tolerances, but must instead make do with more generic parts that can be used to solve many different kinds of problems. And because the cells can use the same parts of the genome to solve different problems, asking questions like “what does that protein do” becomes much harder to answer: the function of a protein depends on the context in which the cell uses it, and the cell can use it in multiple contexts. Proteins are not like keys designed to fit specifically into unique locks, but bind promiscuously to different sites.

This book takes a very systems-thinking approach, as opposed to a mechanistic one, and consequently I find it very appealing. This is a complex, messy world of signaling networks, where behavior emerges from the interaction of genome and environment. There are many connections here to the field of resilience engineering, which has long viewed biology as a model (for example, see Richard Cook’s talk on the resilience of bone). In this model, the genome acts as a set of resources which the cell can leverage to adapt to different challenges. The genome is an example, possible the paradigmatic example, of adaptive capacity. Or, as the biologists Michael Levin and Rafael Yuste put it, whom Ball quotes: “Evolution, it seems, doesn’t come up with answers so much as generate flexible problem-solving agents that can rise to new challenges and figure things out on their own.”

Quick takes on Rogers Network outage executive summary

The Canadian Radio-television and Telecommunications Commission (CRTC) has posted an executive summary of a report on a major telecom outage that happened in 2022 to Rogers Communications, which is one of the major Canadian telecom companies.

The full report doesn’t seem to be available yet, and I’m not sure if it ever will be publicly released (edit: the full report was released on November 2024). I recommend you read the executive summary, but here are some quick impressions of mine.

Note that I’m not a network engineer (I’ve only managed a single rack of servers in my time), so I don’t have any domain expertise here.

Migration!

When you hear “large-scale outage”, a good bet is that involved a migration. The language of the report describes it as an upgrade, but I suspect this qualifies as a migration.

In the weeks leading to the day of the outage on 8 July 2022, Rogers was executing on a seven-phase process to upgrade its IP core network. The outage occurred during the sixth phase of this upgrade process.

I don’t know anything about what’s involved a telecom upgrading its IP core network, but I do have a lot of general opinions about migrations, and I’m willing to bet they apply here as well.

I think of migrations as high-impact, bespoke changes that the system was not originally designed to accommodate.

They’re high-impact because things can go quite badly if something goes wrong. If you’ve worked at a larger company, you’ve probably experienced migrations that seem to take forever, and this is one of the reasons why: there’s a lot of downside risk in doing migration work (and often not much immediate upside benefit for the people who have to do the work, but that’s a story for another day).

Migrations are bespoke in the sense that each migration is a one-off. This makes migrations even more dangerous because:

The organization doesn’t have any operational muscles around doing any particular migration, because each one is new.
Because each migration is unique, it’s not worth the effort to build tooling to support doing the migration. And even if you build tools, those tools will always be new, which means they haven’t been hardened through production use.

There’s a reason why you hear about continuous integration and continuous delivery but not continuous migration, even though every org past a certain age will have multiple migrations in flight.

Finally, migrations are changes that the system was not originally designed to accommodate. In my entire career, during the design of a new system, I have never heard anyone ask, “How are we going to migrate off of this new system at the end of its life?” We just don’t design for migrating off of things. I don’t even know if it’s possible to do so.

Saturation!

Rogers staff removed the Access Control List policy filter from the configuration of the distribution routers. This consequently resulted in a flood of IP routing information into the core network routers, which triggered the outage. The flood of IP routing data from the distribution routers into the core routers exceeded their capacity to process the information. The core routers crashed within minutes from the time the policy filter was removed from the distribution routers configuration. When the core network routers crashed, user traffic could no longer be routed to the appropriate destination. Consequently, services such as mobile, home phone, Internet, business wireline connectivity, and 9-1-1 calling ceased functioning.

Saturation is a term from resilience engineering which refers to a system receiving being pushed to the limit of the amount of load that it can handle. It’s remarkable how many outages in distributed systems are related to some part of the system being overloaded, or hitting a rate limit, or exceeding some other limit. (For example, see Slack’s Jan 2021 outage). This incident is another textbook example of a brittle system, which falls over when it becomes saturated.

Perception of risk

I mentioned earlier that migrations are risky, and everyone knows migrations are risky. Roger engineers knew that as well:

Rogers had initially assessed the risk of this seven-phased process as “High.”

Ironically, the fact that the migration had gone smoothly up until that point led them to revise their risk assessment downwards.

However, as changes in prior phases were completed successfully, the risk assessment algorithm downgraded the risk level for the sixth phase of the configuration change to “Low” risk, including the change that caused the July 2022 outage.

I wrote about this phenomenon in a previous post, Any change can break us, but we can’t treat every change the same. The engineers gained confidence as they progressed through the migration, and things went well. Which is perfectly natural. In fact, this is one of the strengths of the continuous delivery approach: you build enough confidence that you don’t have to babysit every single deploy anymore.

But the problem is that we can never perfectly assess the risk in the system. And no matter how much confidence we build up, that one change that we believe is safe can end up taking down the whole system.

I should note that the report is pretty blame-y when it comes to this part:

Downgrading the risk assessment to “Low” for changing the Access Control List filter in a routing policy contravenes industry norms, which require high scrutiny for such configuration changes, including laboratory testing before deploying in the production network.

I wish I had more context here. How did it make sense to them at the time? What sorts of constraints or pressures were they under? Hopefully the full report reveals more details.

Cleanup

Rogers staff deleted the policy filter that prevented IP route flooding in an effort to clean up the configuration files of the distribution routers.

Cleanup work has many of the same risks as migration work: it’s high-impact and bespoke. Say “cleanup script” to an SRE and watch the reaction on their face.

But not cleaning up is also a risk! The solution can’t be “never do cleanup” in the same way it can’t be “never do migrations”. Rather, we need to recognize that this work always involve risk trade-offs. There’s no safe path here.

Failure mode makes incident response harder

At the time of the July 2022 outage, Rogers had a management network that relied on the Rogers IP core network. When the IP core network failed during the outage, remote Rogers employees were unable to access the management network. …

Rogers staff relied on the company’s own mobile and Internet services for connectivity to communicate among themselves. When both the wireless and wireline networks failed, Rogers staff, especially critical incident management staff, were not able to communicate effectively during the early hours of the outage.

When an outage affects not just your customers but also your engineers doing incident response, life gets a whole lot harder.

This brings to mind the Facebook outage from 2021:

[A]s our engineers worked to figure out what was happening and why, they faced two large obstacles: first, it was not possible to access our data centers through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.

Component substitution fallacy

The authors point out that the system was not designed to handle this sort of overload:

Absence of router overload protection. The July 2022 outage exposed the absence of overload protection on the core network routers. The network failure could have been prevented had the core network routers been configured with an overload limit that specifies the maximum acceptable number of IP routing data the router can support. However, the Rogers core network routers were not configured with such overload protection mechanisms. Hence, when the policy filter was removed from the distribution router, an excessive amount of routing data flooded the core routers, which led them to crash.

This is a great example of the component substitution fallacy, which fails to acknowledge explicit trade-offs that are made within organizations about which parts of the system to work on. Note that the Rogers engineers will certainly build in router overload protection now, but it means that’s engineering effort that won’t be spent building protections against other failure modes that haven’t happened yet.

Acknowledging trade-offs

To the authors’ credit, they explicitly acknowledge the tradeoffs involved in the overall design of the system.

The Rogers network is a national Tier 1 network and is architecturally designed for reliability; it is typical of what would be expected of such a Tier 1 service provider network. The July 2022 outage was not the result of a design flaw in the Rogers core network architecture. However, with both the wireless and wireline networks sharing a common IP core network, the scope of the outage was extreme in that it resulted in a catastrophic loss of all services. Such a network architecture is common to many service providers and is an example of the trend of converged wireline and wireless telecom networks. It is a design choice by service providers, including Rogers, that seeks to balance cost with performance.

I really hope the CRTC eventually releases the full report, I’m looking forward to reading it.

The error term isn’t Pareto distributed

You’re probably familiar with the 80-20 rule: when 80% of the X stems from only 20% of the Y. For example, 80% of your revenue comes from only 20% of your customer, or 80% of the logs that you’re storing are generated from only 20% of the services. Talk to anybody who is looking to reduce cloud costs in their organization, and chances are they’re attacking the problem by looking for the 20% of services that are generating 80% of the costs, rather than trying to reduce cloud usage uniformly across all services.

Not all phenomena follow the 80-20 rule, but it’s common enough in the systems we encounter that it’s a good rule of thumb. The technical term for it is the Pareto principle, and distributions that exhibit this 80-20 phenomena are an example of Pareto distributions, also known as power law distributions.

A common implicit assumption is that availability problems are Pareto-distributed. If you look at incidents and keep track of their causes, you should be able to identify a small number of causes that lead to the majority of the incidents. Because of this, we should attribute a cause to an incident, and then look to see which causes most often contribute to incident in order to identify interventions that will have the largest impact: those 20% of improvements that should yield 80% improvements. If you believe in the RCA (root cause analysis) model of incidents, that’s a reasonable assumption to make: identify the root cause of each incident, track these across incidents, and then invest in projects that attack the most expensive root causes.

If we can identify the problematic red dots, we can achieve significant improvements

But if you’re a frequent reader of this blog, you know there’s an alternative model of how incidents come to be. I’m fond of referring to the alternative as the LFI (learning from incidents) model. However, in the safety science research community this alternative model is more commonly associated with terms such as the New View, the New Look, or Safety-II.

The contrast of the LFI model with the RCA model is captured well in Richard Cook’s famous monograph, How Complex Systems Fail:

Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.

Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident. Indeed, it is the linking of these causes together that creates the circumstances required for the accident.

In this model, incidents don’t happen because of a single cause: rather, it’s through the interaction of multiple contributors.

If incidents stem from problematic interactions rather than problematic components, then focusing on components will lead us astray

Under an incident model where incidents are a result of interactions, we wouldn’t expect there to be a Pareto distribution of causes. This means that if we look at a distribution of incidents by contributor (or cause, or component), we’re unlikely to see any one of these stand out as being the source of a large number of incidents. Instead, by looking at the components instead of the interactions, we’re unlikely to see much of any pattern at all.

Turning back to How Complex Systems Fail again:

Complex systems are heavily and successfully defended against failure.

The high consequences of failure lead over time to the construction of multiple layers of defense against failure. These defenses include obvious technical components (e.g.
backup systems, ‘safety’ features of equipment) and human components (e.g. training, knowledge) but also a variety of organizational, institutional, and regulatory defenses (e.g. policies and procedures, certification, work rules, team training). The effect of these measures is to provide a series of shields that normally divert operations away from accidents.

All of these explicit and implicit components of the system work together to keep things up and running. You can think of this as a process that most of the time generates safety (or availability), but sometimes doesn’t. You can think of these failure cases as a sort of error or residual term, they’re the leftover, the weird cases at the edges of our system. I think treating incidents as an error term is a useful metaphor because we don’t fall into the trap of thinking about error terms as looking like Pareto distributions.

This doesn’t mean that there aren’t patterns of failure in our incidents: there absolutely are. But it means that the patterns we need to look for aren’t going to visible if we don’t ask the right questions. It’s the difference between asking “which services were involved?” and “what were the goal conflicts that the engineers were facing?“

What if everybody did everything right?

In the wake of an incident, we want to answer the questions “What happened?” and, afterwards, “What should we do differently going forward?” Invariably, this leads to people trying to answer the question “what went wrong?”, or, even more specifically, the two questions:

What did we do wrong here?
What didn’t we do that we should have?

There’s an implicit assumption behind these questions that because there was a bad outcome, that there must have been a bad action (or an absence of a good action) that led to that outcome. It’s such a natural conclusion to reach that I’ve only ever seen it questioned by people who have been exposed to concepts from resilience engineering.

In some sense, this belief in bad outcomes from bad actions is like Aristole’s claim that heavier objects fall faster than lighter ones. Intuitively, it seems obvious, but our intuitions lead us astray. But in another sense, it’s quite different, because it’s not something we can test by running an experiment. Instead, the idea that systems fail because somebody did something wrong (or didn’t do something right) is more like a lens or a frame, it’s a perspective, a way of making sense of the incident. It’s like how the fields of economics, psychology, and sociology act as different lenses for making sense of the world: a sociological explanation of a phenomenon (say, the First World War) will be different from an economic explanation, and we will get different insights from the different lenses.

An alternative lens for making sense of an incident is to ask the question “how did this incident happen, assuming that everybody did everything right?” In other words, assume that everybody whose actions contributed to the incident made the best possible decision based on the information they had, and the constraints and incentives that were imposed upon them.

Looking at the incident from this perspective will yield will very different kinds of insights, because it will generate different types of questions, such as:

What information did people know in the moment?
What were the constraints that people were operating under?

Now, I personally believe that the second perspective is strictly superior to the first, but I acknowledge that this is a judgment based on personal experience. However, even if you think the first perspective also has merit, if you truly want to maximize the amount of insight you get from a post-incident analysis, then I encourage you to try to the second perspective as well. Make the claim “Let’s assume everybody did everything right. How could this incident still have happened?” I guarantee, you’ll learn something new about your system that you didn’t know before.

Any change can break us, but we can’t treat every change the same

Here are some excerpts from an incident story told by John Allspaw about his time at Etsy (circa 2012), titled Learning Effectively From Incidents: The Messy Details.

In this story, the site goes down:

September 2012 afternoon, this is a tweet from the Etsy status account saying that there’s an issue on the site… People said, oh, the site’s down. People started noticing that the site is down.

NEW: Site Issues http://t.co/Xkc0zLi2
— Etsy Status (@etsystatus) September 5, 2012

Possibly the referenced issue?

This is a tough outage: the web servers are down so hard that they aren’t even reachable:

And people said, well, actually it’s going to be hard to even deploy because we can’t even get to the servers. And people said, well, we can barely get them to respond to a ping. We’re going to have to get people on the console, the integrated lights out for hard reboots. And people even said, well, because we’re talking about hundreds of web servers. Could it be faster, we could even just power cycle these. This is a big deal here. So whatever it wasn’t in the deploy that caused the issue, it made hundreds of web servers completely hung, completely unavailable.

One of the contributors? A CSS change to remove support for old browsers!

And one of the tasks was with the performance team and the issue was old browsers. You always have these workarounds because the internet didn’t fulfill the promise of standards. So, let’s get rid of the support for IE version seven and older. Let’s get rid of all the random stuff. …
And in this case, we had this template-based template used as far as we knew everything, and this little header-ie.css, was the actual workaround. And so the idea was, let’s remove all the references to this CSS file in this base template and we’ll remove the CSS file.

How does a CSS change contribute to a major outage?

The request would come in for something that wasn’t there, 404 would happen all the time. The server would say, well, I don’t have that. So I’m going to give you a 404 page and so then I got to go and construct this 404 page, but it includes this reference to the CSS file, which isn’t there, which means I have to send a 404 page. You might see where I’m going back and forth, 404 page, fire a 404 page, fire a 404 page. Pretty soon all of the 404s are keeping all of the Apache servers, all of the Apache processes across hundreds of servers hung, nothing could be done.

I love this story because a CSS change feels innocuous. CSS just controls presentation, right? How could that impact availability? From the story (emphasis mine)

And this had been tested and reviewed by multiple people. It’s not all that big of a deal of a change, which is why it was a task that was sort of slated for the next person who comes through boot camp in the performance team.

The reason a CSS change can cascade into an outage is that in a complex system there are all of these couplings that we don’t even know are there until we get stung by them.

One lesson you might take away from this story is “you should treat every proposed change like it could bring down the entire system”. But I think that’s the wrong lesson. The reason I think so is because of another constraint we all face: finite resources. Perhaps in a world where we always had an unlimited amount of time to make any changes, we could take this approach. But we don’t live in that world. We only have a fixed number of hours in a week, which means we need to budget our time. And so we make judgment calls on how much time we’re going to spend on manually validating a change based on how risky we perceive that change to be. When I review someone else’s pull request, for example, the amount of effort I spend on it is going to vary based on the nature of the change. For example, I’m going to look more closely at changes to database schemas than I am to changes in log messages.

But that means that we’re ultimately going to miss some of these CSS-change-breaks-the-site kinds of changes. It’s fundamentally inevitable that this is going to happen: it’s simply in the nature of complex systems. You can try to add process to force people to scrutinize every change with the same level of effort, but unless you remove schedule pressure, that’s not going to have the desired effect. People are going to make efficiency-thoroughness tradeoffs because they are held accountable for hitting their OKRs, and they can’t achieve those OKR if they put in the same amount of effort to evaluate every single production change.

Given that we can’t avoid such failures, the best we can do is to be ready to respond to them.

“Human error” means they don’t understand how the system worked

One of the services that the Amazon cloud provides is called S3, which is a data storage service. Imagine a hypothetical scenario where S3 had a major outage, and Amazon’s explanation of the outage was “a hard drive failed”.

Engineers wouldn’t believe this explanation. It’s not that they would doubt that a hard drive failed; we know that hard drives fail all of the time. In fact, it’s precisely because hard drives are prone to failure, and S3 stays up, that they wouldn’t accept this as an explanation. S3 has been architected to function correctly even in the face of individual hard drives failing. While a failed hard drive could certainly be a contributor to an outage, it can’t be the whole story. Otherwise, S3 would constantly be going down. To say “S3 went down because a hard drive failed” is to admit “I don’t know how S3 normally works when it experiences hard drive failures”.

We accept “human error” as the explanation for failures of reliable systems. Now, I’m a bit of an extremist when it comes to the idea of human error, I believe it simply doesn’t exist. But, let’s put that aside for now, and assume that human error is a real thing, and people make mistakes. The thing is, humans are constantly making mistakes. Every day, in every organization, there are many people that are making many mistakes. The people who work on systems that stay up most of the time are not some sort of hyper-vigilant super-humans that make fewer mistakes than the rest of us. Rather, these people are embedded within systems that have evolved over time to be resistant to these sorts of individual mistakes.

As the late Dr. Richard Cook (no fan of the concept of “human error” himself) put it in How Complex Systems Fail: “Complex systems are heavily and successfully defended against failure”. As a consequence of this, “Catastrophe requires multiple failures – single point failures are not enough.”

Reliable systems are error-tolerant. There are mechanisms within such systems to guard against the kinds of mistakes that people make on a regular basis. Ironically, these mechanisms are not necessarily designed into the system: they can evolve organically and invisibly. But they are there, and they are the reason that these systems stay up day after day.

What this means is that when someone attributes a failure to “human error”, it means that they do not see these defenses in the system, and so they don’t actually have an understanding of how all of these defenses failed in this scenario. When you hear “human error” as an explanation for why a system failed, you should think “this person doesn’t know how the system stays up.” Because without knowing how the system stays up, it is impossible to understand the cases where it comes down.

(I believe Cook himself said something to the effect of “human error is the point where they stopped asking questions”).

Accidents manage you

Here’s a a line I liked from episode 461 of Todd Conklin’s PreAccident Investigation Podcast. At around the 8:25 mark, Conklin says:

….accidents, in fact, aren’t preventable. Accidents manage you, so what you really manage is the capacity for the organization to fail safely.

The phrasing “accidents manage you” is great, because it drives home the fact that an incident is not something that we can control. When an incident happens, the system has, quite literally, gone out of control.

While there’s no action we can take that will prevent all incidents, there are things we can do in advance to limit the harm that result from these future incidents. We can build what Conklin calls capacity. This capacity to absorb risk is the thing that we have control over. But it doesn’t come for free: it requires an investment of time and resources.