Conveying confusion without confusing the reader

Confusion is a hallmark of a complex incident. In the moment, we know something is wrong, but we struggle to make sense of the different signals that we’re seeing. We don’t understand the underlying failure mode.

After the incident is over and the engineers have had a chance to dig into what happened, these confusing signals make sense in retrospect. We find out that about the bug or inadvertent config change or unexpected data corruption that led to the symptoms we saw during the incident.

When writing up the narrative, the incident investigator must choose whether to inform the reader in advance about the details of the failure mode, or to withhold this info until the point in time in the narrative when the engineers involved understood what was happening.

I prefer the first approach: giving the reader information about the failure mode details in the narrative before the actors involved in the incident have that information. This enables the reader to make sense of the strange, anomalous signals in a way that the engineers in the moment were not able to.

I do this because, as a reader, I don’t enjoy the feeling of being confused: I’m not looking for a mystery when I read a writeup. If I’m reading about a series of confusing signals that engineers are looking at (e.g., traffic spikes, RPC errors), and I can’t make sense of them either, I tend to get bored. It’s just a mess of confusion.

On the other hand, if I know why these signals are happening, but the characters in the story don’t know, then that is more effective in creating tension in my mind. I want to read on to resolve the tension, to figure out how the engineers ended up diagnosing the problem.

When informing the reader about the failure mode in advance, the challenge is to avoid infecting the reader with hindsight bias. If the reader thinks, “the problem was obviously X. How could they not see it?”, then I’ve failed in the writeup. What I try to do is put the reader into the head of the people involved as much as possible: to try to convey the confusion they were experiencing in the moment, and the source of that confusion.

By enabling the reader to identify with the people involved, you can communicate to the reader how confusing the situation was to the people involved, without directly inflicting that same confusion upon them.

The Gamma Knife model of incidents

Safety researchers love using metaphors as a framework to describe how accidents happen, which they call accident models.

One of the earliest models, dating back to 1931, is Herbert W. Heinrich’s domino model of accident causation:

Image source:

About sixty years later, in 1990, James Reason proposed the Swiss cheese model of accident causation:

By Davidmack – Own work, CC BY-SA 3.0,

About seven years later, in 1997, Jens Rasmussen proposed the dynamic safety model. This model doesn’t have an evocative a name as “domino” or “Swiss cheese”. I like to call it the “boundary” model, because everyone talks about it in terms of drifting towards a safety boundary:

This diagram originally appears in Rasmussen’s paper Risk management in a dynamic society: a modelling problem. I re-created the diagram from that paper.

I haven’t encountered a good metaphor that captures the role of multiple contributing factors in incidents. I’m going to propose one and call it the Gamma knife model of incidents.

Gamma knife is a system that surgeons use for treating brain tumors by focusing multiple beams of gamma radiation on a small volume inside of the brain.

Multiple beams of gamma radiation converge on the target. From the Radiosurgery wikipedia page.

Each individual beam is of low enough intensity that it doesn’t affect brain tissue. It is only when multiple beams intersect at one point that the combined intensity of the radiation has an impact.

Every day inside of your system, there are things that are happening (or not happening(!)) that could potentially enable an incident. You can think of each of these as a low-level beam of gamma radiation going off in a random direction. Somebody pushes a change to production, zap! Somebody makes a configuration change with a typo, zap! Somebody goes on vacation, zap! There’s an on-call shift change, zap! A particular service hasn’t been deployed in weeks, zap!

Most of these zaps are harmless, they have no observable impact on the health of the overall system. Sometimes, though, many of these zaps will happen to go off at the same time and all point to the same location. When that happens, boom, you have an incident on your hands.

Alas, there’s no way to get rid of all of those little beams of radiation that go off. You can eliminate some of them, but in the process, you’ll invariably create new ones. There are some you can’t avoid, and there are many that you don’t even see, unless you know how to look for them. One of the reasons I am interested in otherwise harmless operational surprises is that they can reveal the existence of previously unknown beams.

Tuning to the future

In short, the resilience of a system corresponds to its adaptive capacity tuned to the future. [emphasis added]

Branlat, Matthieu & Woods, David. (2010). How do systems manage their adaptive capacity to successfully handle disruptions? A resilience engineering perspective. AAAI Fall Symposium – Technical Report

In simple terms, an incident is a bad thing that has happened that was unexpected. This is just the sort of thing that makes people feel uneasy. Instinctively, we want to be able to say “We now understand what has happened, and we are taking the appropriate steps to make sure that this never happens again.”

But here’s the thing. Taking steps to prevent the last incident from recurring doesn’t do anything to help you deal with the next incident, because your steps will have ensured that the next one is going to be completely different. There is, however, one thing that your next incident will have in common with the last one: both of them are surprises.

We can’t predict the future, but we can get better at anticipating surprise, and dealing with surprise when it happens. Getting better at dealing with surprise is what resilience engineering is all about.

The first step is accepting that surprise is inevitable. That’s hard to do. We want to believe that we are in control of our systems, that we’ve plugged all of the holes. Sure, we may have had a problem before, but we fixed that. If we can just take the time to build it right, it’ll work properly.

Accepting that future operational surprises are inevitable isn’t natural for engineers. It’s not the way we think. We design systems to solve problems, and one of the problems is staying up. We aren’t fatalists.

However, once we do accept that operational surprise is inevitable, we can shift our thinking of the system from the computer-based system to the broader socio-technical system that includes both the people and the computers. The solution space here looks very different, because we aren’t used to thinking about designing systems where people are part of the system, especially when we engineers are part of the system we’re building!

But if we want the ability to handle things the future is going to throw at us, then we need to get better at dealing with surprise. Computers are lousy at this, they can’t adapt to situations they weren’t designed to handle. But people can.

In this frame, accepting that operational surprises are inevitable isn’t fatalism. Building adaptive capacity to deal with future surprises is how we tune to the future.

Contributors, mitigators & risks: Cloudflare 2019-07-02 outage

John Graham-Cumming, Cloudflare’s CTO, wrote a detailed writeup of a Cloudflare incident that happened on 2019-07-02. Here’s a categorization similar to the one I did for the Stripe outage.

Note that Graham-Cumming has a “What went wrong” section in his writeup where he explicitly enumerates 11 different contributing factors; I’ve sliced things a little differently here: I’ve taken some of those verbatim, reworded some of them, and left out some others.

All quotes from the original writeup are in italics.

Contributing factors

Remember not to think of these as “causes” or “mistakes”. They are merely all of the things that had to be true for the incident to manifest, or for it to be as severe as it was.

Regular expression lead to catastrophic backtracking

A regular expression used in a firewall engine rule resulted in catastrophic backtracking:


Simulated rules run on same nodes as enforced rules

This particular change was to be deployed in “simulate” mode where real customer traffic passes through the rule but nothing is blocked. We use that mode to test the effectiveness of a rule and measure its false positive and false negative rate. But even in the simulate mode the rules actually need to execute and in this case the rule contained a regular expression that consumed excessive CPU.

Failure mode prevented access to internal services

But getting to the global WAF [web application firewall] kill was another story. Things stood in our way. We use our own products and with our Access service down we couldn’t authenticate to our internal control panel … And we couldn’t get to other internal services like Jira or the build system.

Security feature disables credentials for infrequent use for an operator interface

[O]nce we were back we’d discover that some members of the team had lost access because of a security feature that disables their credentials if they don’t use the internal control panel frequently

Bypass mechanisms not frequently used

And we couldn’t get to other internal services like Jira or the build system. To get to them we had to use a bypass mechanism that wasn’t frequently used (another thing to drill on after the event). 

WAF changes are deployed globally

The diversity of Cloudflare’s network and customers allows us to test code thoroughly before a release is pushed to all our customers globally. But, by design, the WAF doesn’t use this process because of the need to respond rapidly to threats … Because WAF rules are required to address emergent threats they are deployed using our Quicksilver distributed key-value (KV) store that can push changes globally in seconds

The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.

The fact that WAF changes can only be done globally exacerbated the incident by increasing the size of the blast radius.

WAF implemented in Lua, which uses PCRE

Cloudflare makes use of Lua extensively in production … The Lua WAF uses PCRE internally and it uses backtracking for matching and has no mechanism to protect against a runaway expression.

The regular expression engine being used didn’t have complexity guarantees.

Based on the writeup, it sounds like they used the PCRE regular expression library because PCRE is the regex library that ships with Lua, and Lua is the language they use to implement the WAF.

Protection accidentally removed by a performance improvement refactor

A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.

Cloudflare dashboard and API are fronted by the WAF

Our customers were unable to access the Cloudflare Dashboard or API because they pass through the Cloudflare edge.


Paging alert quickly identified a problem

At 13:42 an engineer working on the firewall team deployed a minor change to the rules for XSS detection via an automatic processThree minutes later the first PagerDuty page went out indicating a fault with the WAF. This was a synthetic test that checks the functionality of the WAF (we have hundreds of such tests) from outside Cloudflare to ensure that it is working correctly. This was rapidly followed by pages indicating many other end-to-end tests of Cloudflare services failing, a global traffic drop alert, widespread 502 errors and then many reports from our points-of-presence (PoPs) in cities worldwide indicating there was CPU exhaustion.

Engineers recognized high severity based on alert pattern

This pattern of pages and alerts, however, indicated that something gravely serious had happened, and SRE immediately declared a P0 incident and escalated to engineering leadership and systems engineering.

Existence of a kill switch

At 14:02 the entire team looked at me when it was proposed that we use a ‘global kill’, a mechanism built into Cloudflare to disable a single component worldwide.


Declarative program performance is hard to reason about

Regular expressions are examples of declarative programs (SQL is another good example of a declarative programming language). Declarative programs are elegant because you can specify what the computation should do without needing to specify how the computation can be done.

The downside is that it’s impossible to look at a declarative program and understand the performance implications, because there isn’t enough information in a declarative program to let you know how it will be executed! You have to be familiar with how the interpreter/compiler works to understand the performance implications of a declarative program. Most programmers probably don’t know how regex libraries are implemented.

Simulating in production environment

For rule-based systems, it’s enormously valuable for an engineer to be able to simulate what effect the rules will have before they’re put into effect, as it is generally impossible to reason about their impacts without doing simulation.

The more realistic the simulation is, the more confidence we have that the results of the simulation will correspond to the actual results when the rules are enabled in production.

However, there is always a risk of doing the simulation in the production environment, because the simulation is a type of change, and all types of change carry some risk.

Severe outages happen infrequently

[S]ome members of the team had lost access because of a security feature that disables their credentials if they don’t use the internal control panel frequently.

To get to [internal services] we had to use a bypass mechanism that wasn’t frequently used (another thing to drill on after the event). 

The irony is that when we only encounter severe outages infrequently, we don’t have the opportunity to exercise the muscles we need to use when these outages do happen.

Large blast radius

The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.

In the future, it sounds like non-emergency rule changes will be staged at Cloudflare. But the functionality will still exist for global changes, because it needs to be there for emergency rule changes. They can reduce the amount of changes that need to get pushed globally, but they can’t drive it down to zero. This is an inevitable risk tradeoff.


Why hasn’t this happened before?

You’re not generally supposed to ask “why” questions, but I can’t resist this one. Why did it take so long for this failure mode to manifest? Hadn’t any of the engineers at Cloudflare previously written a rule that used a regex with pathological backtracking behavior? Or was it that refactor that removed their protection from excessive CPU load in the case of regex backtracking?

What was the motivation for the refactor?

A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.

What was the reason they were trying to make the WAF use less CPU? Were they trying to reduce the cost by running on fewer nodes? Were they just trying to run cooler to reduce the risk of running out of CPU? Was there some other rationale?

What’s the history of WAF rule deployment implementation?

The SOP allowed a non-emergency rule change to go globally into production without a staged rollout. [emphasis added]

The WAF rule system is designed to support quickly pushing out rules globally to protect against new attacks. However, not all of the rules require quick global deployment. Yet, this was the only mechanism that the WAF system supported, even though code changes support staged rollout.

The writeup simply mentions this as a contributing factor, but I’m curious as to how the system came to be that way. For example, was it originally designed with only quick rule deployment in mind? Were staged code deploys introduced into Cloudflare only after the WAF system was built?

Other interesting notes

The WAF rule update was normal work

At 13:42 an engineer working on the firewall team deployed a minor change to the rules for XSS detection via an automatic process. 

Based on the writeup, it sounds like this was a routine change. It’s important to keep in mind that incidents often occur as a result of normal work.

Multiple sources of evidence to diagnose CPU issue

The Performance Team pulled live CPU data from a machine that clearly showed the WAF was responsible. Another team member used strace to confirm. Another team saw error logs indicating the WAF was in trouble.

It was interesting to read how they triangulated on high CPU usage using multiple data sources.

Normative language in the writeup

Emphasis added in bold.

We know how much this hurt our customers. We’re ashamed it happened.

The rollback plan required running the complete WAF build twice taking too long.

The first alert for the global traffic drop took too long to fire.

We didn’t update our status page quickly enough.

Normative language is one of the three analytical traps in accident investigation. If this was an internal writeup, I would avoid the language criticizing the rollback plan, the alert configuration, and the status page decision, and instead I’d ask questions about how these came to be, such as:

Was this the first time the rollback plan was carried out? (If so, that may explain the reason why it wasn’t known how long it would take).

Is the global traffic drop alert configuration typical of the other alerts, or different? If it’s different (i.e., other alerts fire faster?) what led to it being different? If it’s similar to other alert configurations, that would explain why it was configured to be “too long”.

Work to reduce CPU usage contributed to excessive CPU usage

A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.

These sorts of unintended consequences are endemic when making changes within complex systems. It’s an important reminder that the interventions we implement to prevent yesterday’s incidents from recurring may contribute to completely new failure modes tomorrow.

Contributors, mitigators & risks: Stripe 2019-07-10 outage

Stripe’s CTO, David Singleton, did a detailed narrative writeup of the incident they had on 2019-07-10. I love narrative descriptions of incidents, and there’s a ton of great detail here.

As an exercise, using the writeup, I collected some aspects of the incident into the following sections:

Contributing factors: what were all of the conditions that had to be present for the outage to happen, or for it to be as severe as it was? It’s important not to think of these as causes, or mistakes, or even bad things.

Mitigators: What kept the incident from being worse that it was?

Risks: What are the more general risks that this incident reveals?

In an ideal world, I’d talk to the people involved directly to get more details, but we work with what we have. I don’t summarize the incident here, so I recommend reading the Stripe writeup first.

Text in italics is copied verbatim from the writeup.

Contributing factors

Minor database version upgrade

Three months ago, we upgraded our databases to a new minor version.the new version … introduced a subtle fault in the database’s failover system that only manifested in the presence of multiple stalled nodes.

One shard had multiple stalled nodes

Two nodes became stalled for yet-to-be-determined reasons. …  a subtle fault in the database’s failover system … only manifested in the presence of multiple stalled nodesOn the day of the events, one shard was in the specific state that triggered this fault, and the shard was unable to elect a new primary.

Stalled nodes reported as healthy

 These [stalled] nodes stopped emitting metrics reporting their replication lag but continued to respond as healthy to active checks.

The database nodes support health checks, but these health checks did not detect the problem. We aren’t provided with any more details about the health check failure mode.

Database writes time out when shard has no elected primary

Without a primary, the shard was unable to accept writes. Applications that write to the shard began to time out. 

Problem manifested on a critical shard

Stripe splits data by kind into different database clusters and by quantity into different shards. Because of widespread use of this shard across applications, including the API, the unavailability of this shard … cascaded into a severe API degradation.

Based on this description, it sounds like this incident would have been much less severe if the problem had manifested on a shard other than this one.

We don’t know how it was that this particular shard was the one where the nodes stalled. It might just be bad luck. Sometimes, that’s the only difference between an incident and a surprise.

Timeouts lead to compute resource starvation

Applications that write to the shard began to time out. Because of widespread use of this shard across applications, including the API, the unavailability of this shard starved compute resources for the API

Novel, complex failure mode

  • [2019-07-10 16:36 UTC] Our team was alerted and we began incident response.
  • [2019-07-10 16:50 UTC] We determined the cluster was unable to elect a primary.

Because this was a complex failure mode that we had not previously experienced, we needed to diagnose the underlying cause and determine the steps to remediate.

The language in the writeup suggests that the complexity and novelty of the failure mode made it more difficult for them to diagnose the problem. However, the timeline suggests that it took them about 14 minutes to figure out that the database was in a bad state. That sounds pretty good to me(!).

Remediation required restarting database cluster

  • [2019-07-10 16:50 UTC] We determined the cluster was unable to elect a primary.
  • [2019-07-10 17:00 UTC] We restarted all nodes in the database cluster, resulting in a successful election.
  • [2019-07-10 17:02 UTC] The Stripe API fully recovered.

Our team identified forcing the election of a new primary as the fastest remediation available, but this required restarting the database cluster. 

The “but” in the sentence above suggests that restarting the database cluster was not an ideal remediation strategy, but there isn’t a rationale given for why it isn’t. It’s not clear from the timeline how long it took to reboot all of the database nodes: it looks like it could be 2 minutes, which sounds pretty quick to me.

Rolling back database version as remediation strategy

  • [2019-07-10 20:13 UTC] During our investigation into the root cause of the first event, we identified a code path likely causing the bug in a new minor version of the database’s election protocol.
  • [2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.

After mitigating user impact, we investigated the root cause and identified a likely code path in a new version of the database’s election protocol. We decided to revert to the previous known stable version for all shards of the impacted cluster.  We deployed this change within four minutes, and until 21:14 UTC the cluster was healthy.

In the moment, rolling back the database version was clearly the rational action to take. Unfortunately for Stripe …. well, see the next contributing factor.

(Also, 4 minutes sounds pretty quick for reverting that database version!)

Recent configuration change to affected shards

 [T]he second period of degradation had a different cause: our revert to a known stable version interacted poorly with a recently-introduced configuration change to the production shards. This interaction resulted in CPU starvation on all affected shards.

We don’t have additional information about this configuration change: presumably it happened after the new version of the database had been deployed.

Second, novel failure mode had same symptoms as the first failure mode

We initially assumed that the same issue had reoccurred on multiple shards, as the symptoms appeared the same as the earlier event. We therefore followed the same mitigation playbook that succeeded earlier.


Unfortunately, there’s not much detail in the writeup to identify the mitigating factors that were in play here, which is a shame, because it sounds like Stripe was able to employ a lot of expertise in order to effectively diagnose and remediate the problems they encountered. And there’s just as much that we can learn from what went right.

Monitoring quickly detected a database problem

Automated monitoring detected the failed election within a minute.

Quick engagement

We began incident response within two minutes.


Gray failure (sensor problem)

These nodes stopped emitting metrics reporting their replication lag but continued to respond as healthy to active checks.

The stalled database nodes passed their health checks. This is a classic example of a gray failure, where there’s some internal failure but it isn’t detected by the internal failure detector.

Gray failures are pernicious because it’s very difficult (perhaps impossible?) to design a system that can handle failures that it cannot detect. These can also be hard to diagnose, because some of the sensors we are depending on to tell us about the state of the world are not giving us the complete story. We have to depend on integrating multiple sources of data, none of which are every completely reliable.

Service with many dependents goes latent

Because of widespread use of this shard across applications, including the API, the unavailability of this shard starved compute resources for the API and cascaded into a severe API degradation.

It’s very difficult to reason about how a distributed system behaves when one of the services goes latent (that’s one of the value propositions of chaos engineering approaches, like ChAP). In circumstances when a service has multiple dependencies, a latency increase can ripple across the system with dire consequences.

Interaction vulnerabilities that involve rare events

As part of the upgrade, we performed thorough testing in our quality assurance environment, and executed a phased production rollout, starting with less critical clusters and moving on to increasingly critical ones. The new version operated properly in production for the past three months, including many successful failovers. However, the new version also introduced a subtle fault in the database’s failover system that only manifested in the presence of multiple stalled nodes. 

Phased rollouts are a great way to build confidence when rolling out new changes. However, in some cases the condition that will trigger a failure mode doesn’t occur often enough to be caught during a phased rollout process.

In this case, the triggering event was when multiple nodes were stalled. That was an uncommon enough event that it didn’t happen during the phased deployment.

Remediations can introduce new failure modes

Incident reviews generally produce action items that are intended to ensure that the same problem doesn’t recur. The risk with these remediations is that they introduce entirely new problems. In this particular case, the database rollback, a remediation action item, introduced a new failure mode.

We should remediate known problems! But we should also always be mindful that focusing too much on “let’s make sure this failure mode can never happen again” can crowd out questions like “how might these proposed remediations lead to new failure modes?”

(I don’t fault Stripe for their actions in this case: I’m quite certain I would have taken the same action as they did in rolling back the database version to the last known good one).

Stripe outlined the following remediation actions going forward.

We are also introducing several changes to prevent failures of individual shards from cascading across large fractions of API traffic. This includes additional circuit-breaking on failed operations to particular clusters, including the one implicated in these events. We will also pursue additional fault isolation techniques to contain the impact of a single failed shard and limit resource consumption by clients attempting repeated retries of failed requests.

It’s not hard to imagine that these new circuit breaker, fault isolation, and resource consumption limiting strategies may new and even more complex failure modes.

Same symptoms, different problem

  • [2019-07-10 16:35 UTC] The first period of degradation started when the primary node for the database cluster failed.
  • [2019-07-10 17:02 UTC] The Stripe API fully recovered.
  • [2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.
  • [2019-07-10 21:14 UTC] We observed high CPU usage in the database cluster. The Stripe API started returning errors for users, marking the start of a second period of severe degradation.

Anyone who has done operations work before will tell you that if you get paged the same day with the same symptoms, you are going to assume it is a reoccurrence of the issue you just remediated. And, usually, it is. But sometimes it isn’t, and that’s what happened in this case.

Rollback leads to unexpected interaction

You can never really roll back a distributed system to a previous state. A rollback, like any other kind of change, can have unexpected consequences due to interactions with other parts of the system that have since changed. It’s easy to forget this, especially since rollbacks are usually effective as a remediation strategy!

Unanswered questions

Here are some questions I had that aren’t addressed in the writeup.

What was the rationale for the original migration?

The writeup doesn’t describe the rationale for upgrading the databases to a new minor version in the first place. Was it to fix an ongoing issue? To leverage a new feature? Good hygiene in keeping versions up to date?

How did they identify that nodes were stalled?

Did the identification of stalled nodes happen in-the-moment, or was this part of post-incident investigation? How did they diagnose that nodes were stalled?

How did they diagnose the failure mode?

How did they figure out that the failure mode was an interaction between stalled nodes and the new version of the database?

Were there risks associated with restarting the entire database cluster?

In hindsight, restarting the database cluster was the right thing to do. How did things look in the moment?

What database are they using?

The writeup doesn’t say which database they were using, and which versions they were running and upgraded to.

What was the problem with the database node health checks?

These nodes stopped emitting metrics reporting their replication lag but continued to respond as healthy to active checks.

What led to the database node health checks reporting healthy for stalled nodes?

How did the two database nodes stall?

Two nodes became stalled for yet-to-be-determined reasons.

Stripe didn’t know the answer to this question at the time of the writeup.

What was the nature of the configuration change?

[O]ur revert to a known stable version interacted poorly with a recently-introduced configuration change to the production shards. 

Stripe doesn’t provide any details on the nature of this configuration change . What was changed? What was the rationale for the change? When it did it happen?

How did the configuration change interact with the database version rollback?

This interaction resulted in CPU starvation on all affected shards.

The writeup doesn’t provide any details into the nature of the CPU starvation other than what is written above.

How did they diagnose the configuration change as a contributing factor?

 Once we observed the CPU starvation, we were able to investigate and identify the root cause. 

How did they investigate this? Where did they look? How did they trace it back to the config change?

Final thought

The database upgrade was normal work

Three months ago, we upgraded our databases to a new minor version. As part of the upgrade, we performed thorough testing in our quality assurance environment, and executed a phased production rollout, starting with less critical clusters and moving on to increasingly critical ones. The new version operated properly in production for the past three months, including many successful failovers.

It doesn’t sound like there was anything atypical about the way they rolled out the new database version. Incidents often happen as a result of normal work! Even more often, though, incidents don’t happen as a result of normal work.

For the rollback, my impression from the writeup is that it was rolled out more quickly than normal database version changes, as a remediation for a known problem for a critical database shard.

What Deming got wrong

One of my Father’s Day presents this year was The Essential Deming, an anthology of Deming’s shorter writings. I thoroughly enjoyed Deming’s Out of the Crisis and was looking to read more from him.

Reading this book, I was surprised to discover that Deming was opposed to workers training workers, which he considered a faulty practice. The most effective way to become an expert is through the apprenticeship model, where a novice works alongside an expert and directly observes how the expert does their work. That Deming would reject this model, and would believe that an outsider could more effectively train a worker than someone who actually does the day-to-day work is, frankly, bizarre.

Deming also asserted that the most effective teachers at a university were the professors that were the best researchers. This also seemed to me to be an extremely odd claim, and one I’m extremely skeptical of.

Deming had a deep understanding of systems thinking, and the importance of holistic, expert judgment (he used the term leadership) over chasing metrics, and that shines through in this book. However, while Deming seemed to recognize the value of expertise, he did not seem to have a good understanding of how people acquire it.

Why incidents can’t be monocausal

When an incident happens, the temptation is strong to identify a single cause. It’s as if the system is a chain, and we’re looking for the weak link that was responsible for the chain breaking. But, in organizations that are going concerns, that isn’t how the system works. It can’t be, because there are simply too many things that can and do go wrong. Think of all the touch points in your system, how many opportunities there are for problems (bugs, typo in config, bad data, …). If any one of these was enough to take down the system, then it would be like a house of cards, falling down all of the time.

What happens in successful organizations as that the system evolves layers of defense, so that it can survive the kinds of individual problems that are always cropping up. Sure, the system still goes down, and more often than we would like. But the uptime is good enough that the company continues to survive.

Here’s an analogy that I’m borrowing from John Allspaw. Think about a significant new feature or service that your organization delivered successfully: one that took multiple quarters and required the collaboration of multiple teams. I’d wager that there were many factors that contributed to the success of this effort. Imagine if someone asked you: “what was the root cause for the success of this feature?”

So it is with incidents. Because an organization can’t prevent the occurrence of individual problems, the system evolves defenses to protect itself, created by the everyday work of the people in the company. Sure, the code we write might not even compile on the first try, but somehow the code that made it out to production is running well enough that the company is still in business. People are doing checks on the system all of the time, and most of this work is invisible.

For an incident to happen, multiple factors must have contributed to penetrate those layers of defenses that have evolved. I say that with confidence, because if a single event could take your system down, then it never would have made it this far to begin with. That’s why, when you dig into an incident, you’ll always find those multiple contributors.

Postmodern engineering

When I was younger, I wanted to be a physicist. I ended up majoring in computer engineering, because I also wanted gainful employment, but my heart was always in physics, and computer engineering seemed like a good compromise between my love of physics and early interest in computers.

I didn’t think too deeply about the philosophy of science back then, but my beliefs were in line with the school of positivism. I believed there was a single underlying reality , the nature of this reality was potentially knowable, and science was an effective tool for understanding that reality. I was vaguely aware of the postmodernist movement, but mostly by reading about the Sokal hoax, where the physicist Alan Sokal had demonstrated that postmodernism was nonsense.

Around the same time, I also read To Engineer is Human: the Role of Failure in Successful Design by the civil engineering researcher Henry Petroski. The book is a case study on how civil engineering advanced through understanding structural failures. Success, on the other hand, teaches the engineer nothing.

Many years later, I find myself operationally a postmodernist (although constructivist might be a more accurate term). When I study how incidents happen, I no longer believe that there is a single, underlying reality of what really happened that we can access. Instead, I believe that the best we can do is construct narratives based on the perspectives of the different people that were involved in the incident. These narratives will inevitably be partial, and some of them may conflict. And there are things that we will never really know or understand. In addition, contra Petroski, I also believe that we can learn from studying successes as well as from studying failure.

I suspect that most engineers are steeped in the positivist tradition of thinking as well. This change in perspective is a big one: I’m not even sure how my own thinking evolved over time, and so I don’t know how to encourage this shift in others. But I do believe that if we want to learn as much as we can from incidents, we need to work on changing how our fellow engineers think about what is knowable. And that’s a tall order.

Root cause: line in Shakespearean play

News recently broke about the crash of Ethiopian Airlines Flight 302. This post is about a different plane crash, Eastern Airlines Flight 375, in 1960. Flight 375 crashed on takeoff from Logan airport in Boston when it flew into a flock of birds. More specifically, in the words of Michael Kalafatas, it “slammed into a flock of ten thousand starlings“.

The starling isn’t native to North America. An American drug manufacturer named Eugene Schieffelin made multiple attempts to bring over different species of bird to the U.S. Many of his his efforts failed, but he was successful at bringing starlings over from Europe, releasing sixty European starlings in 1890 and another forty in 1891. Nate Dimeo recounts the story of the release of the sixty starlings in New York’s Central Park in episode 138 of he memory palace podcast.

Schieffelin’s interest included starlings because he wanted to bring over all of the birds mentioned in Shakespeare plays. The starling is mentioned only once in Shakespeare’s works: in Henry IV, Part I, in a line uttered by Sir Henry Percy:

Nay, I will; that’s flat: 
He said he would not ransom Mortimer; 
Forbad my tongue to speak of Mortimer;
But I will find him when he lies asleep, 
And in his ear I’ll holla ‘Mortimer!’ 
I’ll have a starling shall be taught to speak 
Nothing but ‘Mortimer,’ and give it him
To keep his anger still in motion.

The story is a good example of the problems of using causal language to talk about incidents. I doubt an accident investigation report would list “line in 16th century play” as a cause. And, yet, if Shakespeare had not included that line in the play, or had substituted a different bird for a starling, the accident would not have happened.

Of course, this type of counterfactual reasoning isn’t useful at all, but that’s exactly the point. Whenever we start with an incident, we can always go further back in time and play “for want of a nail”: the place where we stop is determined by factors such as time constraints of the investigation and available information. Neither of those factors are properties of the incident itself.

William Shakespeare didn’t cause Flight 375 to crash, because “causes” don’t exist in the world. Instead, we construct causes when we look backwards from incidents. We do this because of our need to make sense of the world. But the world is a messy, tangled web of interactions. Those causes aren’t real. It’s only by moving beyond the notion of causes that we can learn more about how those incidents came to be.

The danger of “insufficient virtue”

Nate Dimeo hosts a great storytelling podcast called The Memory Palace, where each episode is a short historical vignette. Episode 316: Ten Fingers, Ten Toes is about how people have tried to answer the question: “why are the bodies of some babies drastically different from the bodies of all others?”

The stories in this podcast usually aren’t personal, but this episode is an exception. Dimeo recounts how his great-aunt, Anna, was born without fingers on her left hand. Anna’s mother (Dimeo’s great-grandmother) blamed herself: when pregnant, she had been startled by a salesman knocking on the back door, and had bitten her knuckles. She had attributed the birth defect to her knuckle-biting.

We humans seem to be wired to attribute negative outcomes to behaving insufficiently virtuously. This is particularly apparent in the writing style of many management books. Here are some quotes from a book I’m currently reading.

For years, for example, American manufacturers thought they had to choose between low cost and high quality… They didn’t realize that they could have both goals, if they were willing to wait for one while they focused on the other.

Whenever a company fails, people always point to specific events to explain the “causes” of the failure: product problems, inept managers, loss of key people, unexpectedly aggressive competition, or business downturns. Yet, the deeper systemic causes for unsustained growth go unrecognized.

Why wasn’t that balancing process noticed? First, WonderTech’s financially oriented top management did not pay much attention to their delivery service. They mainly tracked sales, profits, return on investment, and market share. So long as these were healthy, delivery times were the least of their concerns.

Such litanies of “negative visions” are sadly commonplace, even among very successful people. They are the byproduct of a lifetime of fitting in, of coping, of problem solving. As a teenager in one of our programs once said, “We shouldn’t call them ‘grown ups’ we should call them ‘given ups.’

Peter Senge, The Fifth Discipline

In this book (The Fifth Discipline), Senge associates the principles he is advocating for (e.g., systems thinking, personal mastery, shared vision) with virtue, and the absence of these principles with vice. The book is filled with morality tales of the poor fates of companies due to insufficiently virtuous executives, to the point where I feel like I’m reading Goofus and Gallant comics.

This type of moralized thinking, where poor outcomes are caused by insufficiently virtuous behavior, is a cancer on our ability to understand incidents. It’s seductive to blame an incident on someone being greedy (an executive) or sloppy (an operator) or incompetent (a software engineer). Just think back to your reactions to incidents like the Equifax Data Breach or the California wildfires.

The temptation to attribute responsibility when bad things happen is overwhelming. You can always find greed, sloppiness, and incompetence if that’s what you’re looking for. We need to fight that urge. When trying to understand how an incident happened, we need to assume that all of the people involved were acting reasonably given the information they had the time. It means the difference between explaining incidents away, and learning from them.

(Oh, and you’ll probably want to check out the Field Guide to Understanding ‘Human Error’ by Sidney Dekker).