When a bad analysis is worse than none at all

One of the most famous physics experiments in modern history is the double-split experiment, originally performed by the English physicist Thomas Young back in 1801. You probably learned about this experiment in a high school physics class. There was a long debate in physics about whether light was a particle or a wave, and Young’s experiment provided support for the wave theory. (Today, we recognize that light has a dual nature, with both particle-like and wave-like behaviors).

To run the experiment, you need an opaque board that has two slits cut out of it, as well as a screen. You shine a light at the board and look to see what the pattern of light looks like on the screen behind it.

Here’s a diagram from Wikipedia, which shows the experiment being run with electrons rather than light , but is otherwise the same idea.

Original: NekoJaNekoJa Vector: Johannes Kalliauer, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

If light was a particle, then you would expect each light particle to pass through either one slit, or the other. The intensities that you’d observe on the screen would look like the sum of the intensities if you ran the experiment by covering up one slit, and then ran it again by covering up the other slit. It should basically look like the sum of two Gaussian distributions with different means.

However, that isn’t what you actually see on the screen. Instead, you get this pattern where there are some areas of the screen with no intensity at all: where the light never strikes the screen. On the other hand, if you run the experiment by covering up either slit, you will get light at these null locations. This shows that there’s an interference effect, the fact that there are two slits leads the light to behave differently from being the sum of the effects of each slit.

Note that we see the same behavior with electrons (hence the diagram above). Both electrons and light (photons) exhibit this sort of wavelike behavior. This behavior is observed even even if you shine only one electron (or photon) at a time through the slits.

Now, imagine a physicist in the 1970s hires a technician to run this experiment with electrons. The physicist asks the tech to fire one electron at a time from an electron gun at the double-slit board, and record the intensities of the electrons striking a phosphor screen, like on a cathode ray tube (kids, ask your parents about TVs in the old days). Imagine that the physicist doesn’t tell the technicians anything about the theory being tested, the technician is just asked to record the measurements.

Let’s imagine this thought process from the technician:

It’s a lot of work to record the measurements from the phosphor screen, and all of this intensity data is pretty noisy anyways. Instead, why don’t I just identify the one location on the screen that was the brightest, use that location to estimate which slit the electron was most likely to have passed through, and then just record that slit? This will drastically reduce the effort required for each experiment. Plus, the resulting data will be a lot simpler to aggregate than the distribution of messy intensities from each experiment.

The data that the technician records then ends up looking like this:

Experiment	Slit
1	left
2	left
3	right
4	left
5	right
6	left

Now, the experimental data above will give you no insight into the wave nature of electrons, no matter many experiments are run. This sort of experiment is clearly not better than nothing, it’s worse than nothing, because it obscures the nature of the phenomenon that you’re trying to study!

Now, here’s my claim: when people say “the root cause analysis process may not be perfect, but it’s better than nothing”, this is what I worry about. They are making implicit assumptions about a model of incident failure (there’s a root cause), and the information that they are capturing about the incidents is determined by this model.

A root cause analysis approach will never provide insight into how incidents arise through complex interactions, because it intentionally discards the data that could provide that insight. It’s like the technician who does not record all of the intensity measurements, and instead just uses those measurements to pick a slit, and only records the slit.

The alternative is to collect a much richer set of data from each incident. That more detailed data collection is going to be a lot more effort, and a lot messier. It’s going to involve recording details about people’s subjective observations and fuzzy memories, and it will depend on what types of questions are asked of the responders. It will also depend on what sorts of data you even have available to capture. And there will be many subjective decisions about what data to record and what to leave out.

But if your goal is to actually get insights from your incidents about how they’re happening, then that effortful, messy data collection will reveal insights that you won’t ever get from a root cause analysis. Whereas, if you continue to rely on root cause analysis, you are going to be misled about how your system actually fails and how it really works. This is what I mean by good models protect us from bad models, and how root cause analysis can actually be worse than nothing.

Don’t be like the technician, discarding the messy data because it’s cleaner to record which slit the electron went through. Because then you’ll miss that the electron is somehow going through both.

You can’t prevent your last outage, no matter how hard you try

I don’t know anything about your organization, dear reader, but I’m willing to bet that the amount of time and attention your organization spends on post-incident work is a function of the severity of the incidents. That is, your org will spend more post-incident effort on a SEV0 incident compared to a SEV1, which in turn will get more effort than a SEV2 incident, and so on.

This is a rational strategy if post-incident effort could retroactively prevent an incident. SEV0s are worse than SEV1s by definition, so if we could prevent that SEV0 from happening by spending effort after it happens, then we should do so. But no amount of post-incident effort will change the past and stop the incident from happening. So that can’t be what’s actually happening.

Instead, this behavior means that people are making an assumption about the relationship between past and future incidents, one that nobody ever says out loud but everyone implicitly subscribes to. The assumption is that post-incident effort for higher severity incidents is likely to have a greater impact on future availability than post-incident effort for lower severity incidents. In other words, an engineering-hour of SEV1 post-incident work is more likely to improve future availability than an engineering-hour of SEV2 post-incident work. Improvement in future availability refers to either prevention of future incidents, or reduction of the impact of future incidents (e.g., reduction in blast radius, quicker detection, quicker mitigation).

Now, the idea that post-incident work from higher-severity incidents has greater impact than post-incident work from lower-severity incidents is a reasonable theory, as far as theories go. But I don’t believe the empirical data actually supports this theory. I’ve written before about examples of high severity incidents that were not preceded by related high-severity incidents. My claim is that if you look at your highest severity incidents, you’ll find that they generally don’t resemble your previous high-severity incidents. Now, I’m in the no root cause camp, so I believe that each incident is due to a collection of factors that happened to interact.

But don’t take my word for it, take a look at your own incident data. When you have your next high-severity incident, take a look at N high-severity incidents that preceded it (say, N=3), and think about how useful the post-incident incident work of those previous incidents actually was in helping you to deal with the one that just happened. That earlier post-incident work clearly didn’t prevent this incident. Which of the action items, if any, helped with mitigating this incident? Why or why not? Did those other incidents teach you anything about this incident, or was this one just completely different from those? On the other hand, were there sources of information other than high-severity incidents that could have provided insights?

I think we’re all aligned that the goal of post-incident work should be in reducing the risks associated with future incidents. But the idea that the highest ROI for risk reduction work is in the highest severity incidents is not a fact, it’s a hypothesis that simply isn’t supported by data. There are many potential channels for gathering signals of risk, and some of them come from lower severity incidents, and some of them come from data sources other than incidents. Our attention budget is finite, so we need to be judicious about where we spend our time investigating signals. We need to figure out which threads to pull on that will reveal the most insights. But the proposition that the severity of an incident is a proxy for the signal quality of future risk is like the proposition that heavier objects fall faster than lighter one. It’s intuitively obvious; it just so happens to also be false.

Good models protect us from bad models

One of the criticisms leveled at resilience engineering is that the insights that the field generates aren’t actionable: “OK, let’s say you’re right, that complex systems are never perfectly understood, they’re always changing, they generate unexpected interactions, and that these properties explain why incidents happen. That doesn’t tell me what I should do about it!”

And it’s true; I can talk generally about the value of improving expertise so that we’re better able to handle incidents. But I can’t take the model of incidents that I’ve built based on my knowledge of resilience engineering and turn that into a specific software project that you can build and deploy that will eliminate a class of incidents.

But even if these insights aren’t actionable, that they don’t tell us about a single thing we can do or build to help improve reliability, my claim here is that these insights still have value. That’s because we as humans need models to make sense of the world, and if we don’t use good-but-not-actionable models, we can end up with actionable-but-not-good models. Or, as the statistics professor Andrew Gelman put it in his post The social sciences are useless. So why do we study them? Here’s a good reason back in 2021:

The baseball analyst Bill James once said that the alternative to good statistics is not no statistics, it’s bad statistics. Similarly, the alternative to good social science is not no social science, it’s bad social science.

The reason we do social science is because bad social science is being promulgated 24/7, all year long, all over the world. And bad social science can do damage.

Because we humans need models to make sense of the world, incidents models are inevitable. A good-but-not-actionable incident model will feel unsatisfying to people who are looking to leverage these models to take clear action. And it’s all too easy to build not-good-but-actionable models of how incidents happen. Just pick something that you can measure and that you theoretically have control over. The most common example of such a model is the one I’ll call “incidents happen because people don’t follow the processes that they are supposed to.” It’s easy to call out process violations in incident writeups, and it’s easy to define interventions to more strictly enforce processes, such as through automation.

In other words, good-but-not-actionable models protect us from the actionable-but-not-good models. They serve as a kind of vaccine, inoculating us from the neat, plausible, and wrong solutions that H.L. Mencken warned us about.

Tradeoff costs in communication

If you work in software, and I say the word server to you, which do you think I mean?

Software that responds to requests (e.g., http server)
A physical piece of hardware (e.g., a box that sits in a rack in a data center)
A virtual machine (e.g., an EC2 instance)

The answer, of course, is it depends on the context. The term server could mean any of those things. The term is ambiguous; it’s overloaded to mean different things in different contexts.

Another example of an overloaded term is service. From the end user’s perspective, the service is the system they interact with:

From the end user’s perspective, there is a single service

But if we zoom in on that box labeled service, it might be implemented by a collection of software components, where each component is also referred to as a service. This is sometimes referred to as a service-oriented architecture or a microservice architecture

A single “service” may be implemented by multiple “services”. What does “service” mean here?

Amusingly, when I worked at Netflix, people referred to microservices as “services”, but people also referred to all of Netflix as “the service”. For example, instead of saying, “What are you currently watching on Netflix?”, a person would say, “What are you currently watching on the service?”

Yet another example is the term “client”. This could refer to the device that the end-user is using (e.g., web browser, mobile app):

Or it could refer to the caller service in a microservice architecture:

It could also refer to the code in the caller service that is responsible for making the request, typically packaged as a client library.

The fact that the meaning of these terms is ambiguous and context-dependent makes it harder to understand what someone is talking about when the term is used. While the person speaking knows exactly what sense of server, service or client they mean, the person hearing it does not.

The ambiguous meaning of these terms creates all sorts of problems, especially when communicating across different teams, where the meaning of a term used by the client team of a service may be different from the meaning of the term used by the owner of a service. I’m willing to bet that you, dear reader, have experienced this problem at some point in the past when reading an internal tech doc or trying to parse the meaning of a particular slack message.

As someone who is interested in incidents, I’m acutely aware of the problematic nature of ambiguous language during incidents, where communication and coordination play an enormous role in effective incident handling. But it’s not just an issue for incident handling. For example, Eric Evans advocates the use of ubiquitous language in software design. He pushes for consistent use of terms across different stakeholders to reduce misunderstandings.

In principle, we could all just decide to use more precise terminology. This would make it easier for listeners to understand the intent of speakers, and would reduce the likelihood of problems that stem from misunderstandings. At some level, this is the role that technical jargon plays. But client, server and service are technical jargon, and they’re still ambiguous. So, why don’t we just use even more precise language?

The problem is that expressing ourselves in unambiguous isn’t free: it costs the speaker additional effort to be more precise. As a trivial example, microservice is more precise than service, but it takes twice as long as to say, and it takes an additional five letters to write. Those extra syllables and letters are a cost to the speaker. And, all other things being equal, people prefer expending less effort than more effort. This is why we don’t like being on the receiving end of ambiguous language, because we have to put more effort into resolving the ambiguity through context clues.

The cost of precision to the speaker is clear in the world of computer programming. Traditional programming languages require an extremely high degree of precision on behalf of the coder. This sets a very high bar for being able to write programs. On the other hand, modern generative AI tools are able to take natural language inputs as specifications which are orders of magnitude less precise, and turn them into programs. These tools are able to process ambiguous inputs in ways that regular programming languages simply cannot. The cost in effort is much lower for the vibe programmer. (I will leave evaluation of the outcomes of vibe programming as an exercise for the reader).

Ultimately, the degree in precision in communication is a tradeoff: an increase in precision means less effort for the listener and less risk of misunderstanding, at a cost of more effort for the speaker. Because of this tradeoff, we shouldn’t expect the equilibrium point to be at maximal precision. Instead, it’s somewhere in the middle. Ideally, it would be where we minimize the total effort. Now, I’m not a cognitive scientist, but this is a theory that has been advanced by cognitive scientists. For example, see the paper The communicative function of ambiguity in language by Steven T. Piantadosi, Harry Tily, and Edward Gibson. I touched on the topic of ambiguity more generally in a previous post the high cost of low ambiguity.

We often ask “why are people doing X instead of the obviously superior Y“. This is an example of how we are likely missing the additional costs of choosing to do Y over X. Just because we don’t notice those costs doesn’t mean they aren’t there. It means we aren’t looking closely enough.

Model error

One of the topics I wrote about in my last post was about using formal methods to build a model of how our software behaves. In this post, I want to explore how the software we write itself contains models: models of how the world behaves.

The most obvious area is in our database schemas. These schemas enable us to digitally encode information about some aspect of the world that our software cares about. Heck, we even used to refer to this encoding of information into schemas as data models. Relational modeling is extremely flexible: in principle, we can represent just about any aspect of the world into it, if we put enough effort in. The challenge is that the world is messy, and this messiness significantly increases the effort required to build more complete models. Because we often don’t even recognize the degree of messiness the real world contains, we build over-simplified models that are too neat. This is how we end up with issues like the ones captured in Patrick McKenzie’s essay Falsehoods Programmers Believe About Names. There’s a whole book-length meditation on the messiness of real data and how it poses challenges for database modeling: Data and Reality by William Kent, which is highly recommended by Hillel Wayne, in his post Why You Should Read “Data and Reality”.

The problem of missing the messiness of the real world is not at all unique to software engineers. For example, see Christopher Alexander’s A City Is Not a Tree for a critique of urban planners’ overly simplified view of human interactions in urban environments. For a more expansive lament, check out James C. Scott’s excellent book Seeing Like a State. But, since I’m a software engineer and not an urban planner or a civil servant, I’m going to stick to the software side of things here.

Models in the back, models in the front

In particularly, my own software background is in the back-end/platform/infrastructure space. In this space, the software we write frequently implement control systems. It’s no coincidence that both cybernetics and kubernetes derived their names from the same ancient Greek word: κυβερνήτης. Every control system must contain within it a model of the system that it controls. Or, as Roger C. Conant and W. Ross Ashby put it, every good regulator of a system must be a model of that system.

Things get even more complex on the front-end side of the software world. This world must bridge the software world with the human world. In the context of Richard Cook’s framing in Above the Line, Below the Line, the front-end is the line that bridges the two world. As a consequence, the front-end’s responsibility is to expose a model of the software’s internal state to the user. This means that the front-end also has an implicit model of the users themselves. In the paper Cognitive Systems Engineering: New wine in new bottles, Erik Hollnagel and David Woods referred to this model as the image of the operator.

The dangers of the wrongness of models

There’s an oft-repeated quote by the statistician George E.P. Box: “All models are wrong but some are useful”. It’s a true statement, but one that focuses only on the upside of wrong models, the fact that some of them are useful. There’s also a downside to the fact that all models are wrong: the wrongness of these models can have drastic consequences.

And, while It’s a true statement, but what it fails to hint at how bad the consequences can be when a model is wrong. One of my favorite examples involves the 2008 financial crisis, as detailed by the journalist Felix Salmon’s 2009 Wired Magazine article Recipe for Disaster: The Formula that Killed Wall Street. The article described how Wall Street quants used a mathematical model known as the Gaussian copula function to estimate risk. It was a useful model that ultimately led to disaster.

Here’s A ripped-from-the-headlines example of image of the operator model error, how the U.S. national security advisor Mike Waltz accidentally saved the phone number of Jeffrey Goldberg, editor of the Atlantic magazine, to the contact information of White House spokesman Brian Hughes. The source is the recent Guardian story How the Atlantic’s Jeffrey Goldberg got added to the White House Signal group chat:

According to three people briefed on the internal investigation, Goldberg had emailed the campaign about a story that criticized Trump for his attitude towards wounded service members. To push back against the story, the campaign enlisted the help of Waltz, their national security surrogate.

Goldberg’s email was forwarded to then Trump spokesperson Brian Hughes, who then copied and pasted the content of the email – including the signature block with Goldberg’s phone number – into a text message that he sent to Waltz, so that he could be briefed on the forthcoming story.

Waltz did not ultimately call Goldberg, the people said, but in an extraordinary twist, inadvertently ended up saving Goldberg’s number in his iPhone – under the contact card for Hughes, now the spokesperson for the national security council.

…

According to the White House, the number was erroneously saved during a “contact suggestion update” by Waltz’s iPhone, which one person described as the function where an iPhone algorithm adds a previously unknown number to an existing contact that it detects may be related.

The software assumed that, when you receive a text from someone with a phone number and email address, that the phone number and email address belong to the sender. This is a model of the user that turned out to be very, very wrong.

Nobody expects model error

Software incidents involve model errors in one way or another, whether it’s an incorrect model of the system being controlled, an incorrect image of the operator, or a combination of the two.

And, yet, despite us all intoning “all models are wrong, some models are useful”, we don’t internalize that our systems our built on top of imperfect models. This is one of the ironies of AI: we are now all aware of the risks associated with model error with LLMs. We’ve even come up with a separate term for it: hallucinations. But traditional software is just as vulnerable to model error as LLMs are, because our software is always built on top of models that are guaranteed to be incomplete.

You’re probably familiar with the term black swan, popularized by the acerbic public intellectual Nassim Nicholas Taleb. While his first book, Fooled by Randomness, was a success, it was the publication of The Black Swan that made Taleb a household name, and introduced the public to the concept of black swans. While the term black swan was novel, the idea it referred to was not. Back in the 1980s, the researcher Zvi Lanir used a different term: fundamental surprise. Here’s an excerpt of a Richard Cook lecture on the 1999 Tokaimura nuclear accident where he talks about this sort of surprise (skip to the 45 minute mark).

And this Tokaimura accident was an impossible accident.

There’s an old joke about the creator of the first English American dictionary, Noah Webster … coming home to his house and finding his wife in bed with another man. And she says to him, as he walks in the door, she says, “You’ve surprised me”. And he says, “Madam, you have astonished me”.

The difference was that she of course knew what was going on, and so she could be surprised by him. But he was astonished. He had never considered this as a possibility.

And the Tokaimura was an astonishment or what some, what Zev Lanir and others have called a fundamental surprise which means a surprise that is fundamental in the sense that until you actually see it, you cannot believe that it is possible. It’s one of those “I can’t believe this has happened”. Not, “Oh, I always knew this was a possibility and I’ve never seen it before” like your first case of malignant hyperthermia, if you’re a an anesthesiologist or something like that. It’s where you see something that you just didn’t believe was possible. Some people would call it the Black Swan.

Black swans, astonishment, fundamental surprise, these are all synonyms for model error.

And these sorts of surprises are going to keep happening to us, because our models are always wrong. The question is: in the wake of the next incident, will we learn to recognize that fundamental surprises will keep happening to us in the future? Or will we simply patch up the exposed problems in our existing models and move on?

Models, models every where, so let’s have a think

If you’re a regular reader of this blog, you’ll have noticed that I tend to write about two topics in particular:

Resilience engineering
Formal methods

I haven’t found many people who share both of these interests.

At one level, this isn’t surprising. Formal methods people tend to have an analytic outlook, and resilience engineering people tend to have a synthetic outlook. You can see the clear distinction between these two perspectives in the transcript of Leslie Lamport’s talk entitled The Future of Computing: Logic or Biology. Lamport is clearly on the side of the logic, so much so that he ridicules the very idea of taking a biological perspective on software systems. By contrast, resilience engineering types actively look to biology for inspiration on understanding resilience in complex adaptive systems. A great example of this is the late Richard Cook’s talk on The Resilience of Bone.

And yet, the two fields both have something in common: they both recognize the value of creating explicit models of aspects of systems that are not typically modeled.

You use formal methods to build a model of some aspect of your software system, in order to help you reason about its behavior. A formal model of a software system is a partial one, typically only a very small part of the system. That’s because it takes effort to build and validate these models: the larger the model, the more effort it takes. We typically focus our models on a part of the system that humans aren’t particularly good at reasoning about unaided, such as concurrent or distributed algorithms.

The act of creating and explicit model and observing its behavior with a model checker gives you a new perspective on the system being modeled, because the explicit modeling forces you to think about aspects that you likely wouldn’t have considered. You won’t say “I never imagined X could happen” when building this type of formal model, because it forces you to explicitly think about what would happen in situations that you can gloss over when writing a program in a traditional programming language. While the scope of a formal model is small, you have to exhaustively specify the thing within the scope you’ve defined: there’s no place to hide.

Resilience engineering is also concerned with explicit models, in two different ways. In one way, resilience engineering stresses the inherent limits of models for reasoning about complex systems (c.f., itsonlyamodel.com). Every model is incomplete in potentially dangerous ways, and every incident can be seen through the lens of model error: some model that we had about the behavior of the system turned out to be incorrect in a dangerous way.

But beyond the limits of models, what I find fascinating about resilience engineering is the emphasis on explicitly modeling aspects of the system that are frequently ignored by traditional analytic perspectives. Two kinds of models that come up frequently in resilience engineering are mental models and models of work.

A resilience engineering perspective on an incident will look to make explicit aspects of the practitioners’ mental models, both in the events that led up to that incident, and in the response to the incident. When we ask “How did the decision make sense at the time?“, we’re trying to build a deeper understanding of someone else’s state of mind. We’re explicitly trying to build a descriptive model of how people made decisions, based on what information they had access to, their beliefs about the world, and the constraints that they were under. This is a meta sort of model, a model of a mental model, because we’re trying to reason about how somebody else reasoned about events that occurred in the past.

A resilience engineering perspective on incidents will also try to build an explicit model of how work happens in an organization. You’ll often heard the short-hand phrase work-as-imagined vs work-as-done to get at this modeling, where it’s the work-as-done that is the model that we’re after. The resilience engineering perspective asserts that the documented processes of how work is supposed to happen is not an accurate model of how work actually happens, and that the deviation between the two is generally successful, which is why it persists. From resilience engineering types, you’ll hear questions in incident reviews that try to elicit some more details about how the work really happens.

Like in formal methods, resilience engineering models only get at a small part of the overall system. There’s no way we can build complete models of people’s mental models, or generate complete descriptions of how they do their work. But that’s ok. Because, like the models in formal methods, the goal is not completeness, but insight. Whether we’re building a formal model of a software system, or participating in a post-incident review meeting, we’re trying to get the maximum amount of insight for the modeling effort that we put in.

Paxos made visual in FizzBee

Unfortunately, Paxos is quite difficult to understand, in spite of numerous attempts to make it more approachable. — Diego Ongaro and John Ousterhout, In Search of an Understandable Consensus Algorithm.

In fact, [Paxos] is among the simplest and most obvious of distributed algorithms. — Leslie Lamport, Paxos Made Simple.

I was interested in exploring FizzBee more, specifically to play around with its functionality for modeling distributed systems. In my previous post about FizzBee, I modeled a multithreaded system where coordination happened via shared variables. But FizzBee has explicit support for modeling message-passing in distributed systems, and I wanted to give that a go.

I also wanted to use this as an opportunity to learn more about a distributed algorithm that I had never modeled before, so I decided to use it to model Leslie Lamport’s Paxos algorithm for solving the distributed consensus problem. Examples of Paxos implementations in the wild include Amazon’s DynamoDB, Google’s Spanner, Microsoft Azure’s Cosmos DB, and Cassandra. But it has a reputation of being difficult to understand.

You can see my FizzBee model of Paxos at https://github.com/lorin/paxos-fizzbee/blob/main/paxos-register.fizz.

What problem does Paxos solve?

Paxos solves what is known as the consensus problem. Here’s how Lamport describes the requirements for conensus.

Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value.

I’ve always found the term chosen here to be confusing. In my mind, it invokes some agent in the system doing the choosing, which implies that there must be a process that is aware of which value is the chosen consensus value once it the choice has been made. But that isn’t actually the case. In fact, it’s possible that a value has been chosen without any one process in the system knowing what the consensus value is.

One way to verify that you really understand a concept is to try to explain it in different words. So I’m going to recast the problem to implementing a particular abstract data type: a single-assignment register.

Single assignment register

A register is an abstract data type that can hold a single value. It supports two operations: read and write. You can think of a register like a variable in a programming language.

A single assignment register that can only be written to once. Once a client writes to the register, all future writes will fail: only reads will succeed. The register starts out with a special uninitialized value, the sort of thing we’d represent as NULL in C or None in Python.

If the register has been written to, then a read will return the written value.

Only one write can succeed against a single assignment register. In this example, it is the “B” write that succeeds.

Some things to note about the specification for our single assignment register:

We doesn’t say anything about which write should succeed, we only care that at most one write succeeds.

The write operations don’t return a value, so the writers don’t receive information about whether the write succeeded. The only way to know if a write succeeded is to perform a read.

Instead of thinking of Paxos as a consensus algorithm, you can think of it as implementing a single assignment register. The chosen value is the value where the write succeeds.

I used Lamport’s Paxos Made Simple paper as my guide for modeling the Paxos algorithm. Here’s the mapping between terminology used in that paper and the alternate terminology that I’m using here.

Paxos Made Simple paper	Single assignment register (this blog post)
choosing a value	quorum write
proposers	writers
acceptors	storage nodes
learners	readers
accepted proposal	local write
proposal number	logical clock

As a side note: if you ever wanted to practice doing a refinement mapping with TLA+, you could take one of the existing TLA+ Paxos models and see if you can define a refinement mapping to a single assignment register.

Making our register fault-tolerant with quorum write

One of Paxos’s requirements is that it is fault tolerant. That means a solution that implements a single assignment register using a single node isn’t good enough, because that node might fail. We need multiple nodes to implement our register:

Our single assignment register must be implemented using multiple nodes. The red square depicts a failed node.

If you’ve ever used a distributed database like DynamoDB or Cassandra, then you’re likely familiar with how they use a quorum strategy, where a single write or read may resulting in queries against multiple database nodes.

You can think of Paxos as implementing a distributed database that consists of one single assignment register, where it implements quorum writes.

The way these writes work are:

The writer selects a quorum of nodes to attempt to write to: this is a set of nodes that must contain at least a majority. For example, if the entire cluster contains five nodes, then a quorum must contain at least three.
If the writer attempts to write to every node in the quorum it has selected.

In Lamport’s original paper that introduced Paxos, The Part-Time Parliament, he showed a worked out example of a Paxos execution. Here’s that figure, with some annotations that I’ve added to describe it in terms of a single assignment quorum write register.

In this example, there are five nodes in the cluster, designated by Greek letters {Α,Β,Γ,Δ,Ε}.

The number (#) column acts as a logical clock, we’ll get to that later.

The decree column shows the value that a client attempts to write. In this example, there are two different values that clients attempt to write: {α,β}.

The quorum and voters columns indicate which nodes are in the quorum that the writer selected. A square around a node indicates that the write succeeded against that node. In this example, a quorum must contain at least three nodes, though it can have more than three: the quorum in row 5 contains four nodes.

Under this interpretation, in the first row, the write operation with the argument α succeeded on node Δ: there was a local write to node Δ, but there was not yet a quorum write, as it only succeeded on one node.

While the overall algorithm implements a single assignment register, the individual nodes themselves do not behave as single assignment registers: the value written to a node in can potentially change during the execution of the Paxos algorithm. In the example above, in row 27, the value β is successfully written to node Δ, which is different from the value α written to that node in row 2.

Safety condition: can’t change a majority

The write to our single assignment register occurs when there’s a quorum write: when a majority of the nodes have the same value written to them. To enforce single assignment, we cannot allow a majority of nodes to see a different written value over time.

Here’s how I expressed that safety condition in FizzBee, where written_values is a history variable that keeps track of which values were successfully written to a majority of nodes.

# Only a single value is written
always assertion SingleValueWritten:
    return len(written_values)<=1

Here’s an example scenario that would violate that invariant:

In this scenario, there are three nodes {a,b,c} and two writers. The first writer writes the value x to nodes a and b. As a consequence, x is the value written to the majority of nodes. The second writer writes the value y to nodes b and c, and so y becomes the value written to the majority of nodes. This means that the set of values written is: {x, y}. Because our single assignment register only permits one value to be registered, the algorithm must ensure that a scenario like this does not occur.

Paxos uses two strategies to prevent writes that could change the majority:

Read-before-write to prevent clobbering a known write
Unique, logical timestamps to prevent concurrent writes

Read before write

In Paxos, a writer will first do a read against all of the nodes in its quorum. If any node already contains a write, the writer will use the existing written value.

In the first phase, writer 2 reads a value x from node b. In phase 2, it writes x instead of y to avoid changing the majority.

Preventing concurrent writes

The read-before-write approach works if writer 2 tries to do a write after writer 1 has completed its write. But if the writes overlap, then this will not prevent one writer from clobbering the other writer’s quorum write:

Writer 2 clobbers writer 1’s write on node b because the writer 2’s write had not happened yet when writer 1 did its read.

Paxos solves this by using a logical clock scheme to ensure that only one concurrent writer can succeed. Note that Lamport doesn’t refer to it as a logical clock, but I found it useful to think of it this way.

Each writer has a local clock which is set to a different value. When the writer makes read or write calls, It passes the time of the clock as an additional argument.

Each storage node keeps a logical clock. This storage node’s clock is updated by a read call: if the timestamp of the read call is later than the storage node’s local clock, then the node will advance its clock to match the read timestamp. The node will reject writes with timestamps that are dated before its clock.

In the example above, node b rejects writer 1’s write because the write has a timestamp of 1, and node b has a logical clock value of 2. As a consequence, a quorum write only occurs when writer 2 completes its write.

Readers

The writes are the interesting part of Paxos, which is where I focused. In my FizzBee model, I chose the simplest way to implement readers: a pub-sub approach where each node publishes out each successful write to all of the readers.

The readers then keep a tally of the writes that have occurred on each node, and when they identify a majority, they record it.

Modeling with FizzBee

For my FizzBee model, I defined three roles:

Writer
StorageNode
Reader

Writer

There are two phases to the writes. I modeled each phase as an action. Each writer uses its own identifier, __id__, as the value to be written. This is the sort of thing you’d do when using Paxos to do leader election.

role Writer:
    action Init:
        self.v = self.__id__
        self.latest_write_seen = -1
        self.quorum = genericset()

    action Phase1:
        unsent = genericset(storage_nodes)
        while is_majority(len(unsent)):
            node = any unsent
            response = node.read_and_advance_clock(self.clock)
            (clock_advanced, previous_write) = response
            unsent.discard(node)

            require clock_advanced
            atomic:
                self.quorum.add(node)
                if previous_write and previous_write.ts > self.latest_write_seen:
                    self.latest_write_seen = previous_write.ts
                    self.v = previous_write.v

    action Phase2:
        require is_majority(len(self.quorum))
        for node in self.quorum:
            node.write(self.clock, self.v)

One thing that isn’t obvious is that there’s a variable named clock that gets automatically injected into the role when the instance is created in the top-level Init action:

action Init:
    writers = []
    ...
    for i in range(NUM_WRITERS):
        writers.append(Writer(clock=i))

This is how I ensured that each writer had a unique timestamp associated with it.

StorageNode

The storage node needs to support two RPC calls, one for each of the write phases:

read_and_advance_clock
write

It also has a helper function named notify_readers, which does the reader broadcast.

role StorageNode:
    action Init:
        self.local_writes = genericset()
        self.clock = -1

    func read_and_advance_clock(clock):
        if clock > self.clock:
            self.clock = clock

        latest_write = None

        if self.local_writes:
            latest_write = max(self.local_writes, key=lambda w: w.ts)
        return (self.clock == clock, latest_write)


    atomic func write(ts, v):
        # request's timestamp must be later than our clock
        require ts >= self.clock

        w = record(ts=ts, v=v)
        self.local_writes.add(w)
        self.record_history_variables(w)

        self.notify_readers(w)

    func notify_readers(write):
        for r in readers:
            r.publish(self.__id__, write)

There’s a helper function I didn’t show here called record_history_variables, which I defined to record some data I needed for checking invariants, but isn’t important for the algorithm itself.

Reader

Here’s my FizzBee model for a reader. Note how it supports one RPC call, named publish.

role Reader:
    action Init:
        self.value = None
        self.tallies = genericmap()
        self.seen = genericset()

    # receive a publish event from a storage node
    atomic func publish(node_id, write):
        # Process a publish event only once per (node_id, write) tuple
        require (node_id, write) not in self.seen
        self.seen.add((node_id, write))

        self.tallies.setdefault(write, 0)
        self.tallies[write] += 1
        if is_majority(self.tallies[write]):
            self.value = write.v

Generating interesting visualizations

I wanted to generate a trace where there a quorum write succeeded but not all nodes wrote the same value.

I defined an invariant like this:

always assertion NoTwoNodesHaveDifferentWrittenValues:
    # we only care about cases where consensus was reached
    if len(written_values)==0:
        return True
    s = set([max(node.local_writes, key=lambda w: w.ts).v for node in storage_nodes if node.local_writes])
    return len(s)<=1

Once FizzBee found a counterexample, I used it to generate the following visualizations:

General observations

I found that FizzBee was a good match for modeling Paxos. FizzBee’s roles mapped nicely onto the roles described in Paxos Made Simple, and the phases mapped nicely onto FizzBee’s action. FizzBee’s first-class support for RPC made the communication easy to implement.

I also appreciated the visualizations that FizzBee generated. I found both the sequence diagrams of the model state diagram useful as I was debugging my model.

Finally, I learned a lot more about how Paxos works by going through the exercise of modeling it, as well as writing this blog post to explain it. When it comes to developing a better understanding of an algorithm, there’s no substitute for the act of building a formal model of it and then explaining your model to someone else.

Locks, leases, fencing tokens, FizzBee!

FizzBee is a new formal specification language, originally announced back in May of last year. FizzBee’s author, Jayaprabhakar (JP) Kadarkarai, reached out to me recently and asked me what I thought of it, so I decided to give it a go.

To play with FizzBee, I decided to model some algorithms that solve the mutual exclusion problem, more commonly known as locking. Mutual exclusion algorithms are a classic use case for formal modeling, but here’s some additional background motivation: a few years back, there was an online dust-up between Martin Kleppmann (author of the excellent book Designing Data-Intensive Applications, commonly referred to as DDIA) and Salvatore Sanfilippo (creator of Redis, and better known by his online handle antirez). They were arguing about the correctness of an algorithm called Redlock that claims to achieve fault-tolerant distributed locking. Here are some relevant links:

Distributed Locks with Redis – description of the Redlock algorithm
How to do distributed locking – Kleppmann’s critique of the Redlock algorithm
Is Redlock safe? – antirez’s rebuttal to Kleppmann

As a FizzBee exercise, I wanted to see how difficult it was to model the problem that Kleppmann had identified in Redlock.

Keep in mind here that I’m just a newcomer to the language writing some very simple models as a learning exercise.

Critical sections

Here’s my first FizzBee model, it models the execution of two processes, with an invariant that states that at most one process can be in the critical section at a time. Note that this model doesn’t actually enforce mutual exclusion, so I was just looking to see that the assertion was violated.

# Invariant to check
always assertion MutualExclusion:
    return not any([p1.in_cs and p2.in_cs for p1 in processes
                                          for p2 in processes
                                          if p1 != p2])
NUM_PROCESSES = 2

role Process:
    action Init:
        self.in_cs = False

    action Next:
        # before critical section
        pass

        # critical section
        self.in_cs = True
        pass

        # after critical section
        self.in_cs = False
        pass

action Init:
    processes = []
    for i in range(NUM_PROCESSES):
        processes.append(Process())

The “pass” statements are no-ops, I just use them as stand-ins for “code that would execute before/during/after the critical section”.

FizzBee is built on Starlark, which is a subset of Python, which why the model looks so Pythonic. Writing a FizzBee model felt like writing a PlusCal model, without the need for specifying labels explicitly, and also with a much more familiar syntax.

The lack of labels was both a blessing and a curse. In PlusCal, the control state is something you can explicitly reference in your model. This is useful for when you want to specify a critical section as an invariant. Because FizzBee doesn’t have labels, I had to create a separate variable called “in_cs” to be able to model when a process was in its critical section. In general, though, I find PlusCal’s label syntax annoying, and I’m happy that FizzBee doesn’t require it.

FizzBee has an online playground: you can copy the model above and paste it directly into the playground and click “Run”, and it will tell you that the invariant failed.

FAILED: Model checker failed. Invariant:  MutualExclusion

The “Error Formatted” view shows how the two processes both landed on line 17, hence violating mutual exclusion:

Locks

Next up, I modeled locking in FizzBee. In general, I like to model a lock as a set, where taking the lock means adding the id of the process to the set, because if I need to, I can see:

who holds the lock by the elements of the set
if two processes somehow manage to take the same lock (multiple elements in the set)

Here’s my FizzBee mdoel:

always assertion MutualExclusion:
    return not any([p1.in_cs and p2.in_cs for p1 in processes
                                          for p2 in processes
                                          if p1 != p2])

NUM_PROCESSES = 2

role Process:
    action Init:
        self.in_cs = False

    action Next:
        # before critical section
        pass

        # acquire lock
        atomic:
            require not lock
            lock.add(self.__id__)

        #
        # critical section
        #
        self.in_cs = True
        pass
        self.in_cs = False

        # release lock
        lock.clear()

        # after critical section
        pass

action Init:
    processes = []
    lock = set()
    in_cs = set()
    for i in range(NUM_PROCESSES):
        processes.append(Process())

By default, each statement in FizzBee is treated atomically, and you can specify an atomic block to treat multiple statements automatically.

If you run this in the playground, you’ll see that the invariant holds, but there’s a different problem: deadlock

DEADLOCK detected
FAILED: Model checker failed

FizzBee’s model checker does two things by default:

Checks for deadlock
Assumes that a thread can crash after any arbitrary statement

In the “Error Formatted” view, you can see what happened. The first process took the lock and then crashed. This leads to deadlock, because the lock never gets released.

Leases

If we want to build a fault-tolerant locking solution, we need to handle the scenario where a process fails while it owns the lock. The Redlock algorithm uses the concept of a lease, which is a lock that expires after a period of time.

To model leases, we now need to model time. To keep things simple, my model assumes a global clock that all processes have access to.

NUM_PROCESSES = 2
LEASE_LENGTH = 10


always assertion MutualExclusion:
    return not any([p1.in_cs and p2.in_cs for p1 in processes
                                          for p2 in processes
                                          if p1 != p2])

action AdvanceClock:
    clock += 1

role Process:
    action Init:
        self.in_cs = False

    action Next:
        atomic:
            require lock.owner == None or \
                    clock >= lock.expiration_time
            lock = record(owner=self.__id__,
                          expiration_time=clock+LEASE_LENGTH)

        # check that we still have the lock
        if lock.owner == self.__id__:
            # critical section
            self.in_cs = True
            pass
            self.in_cs = False

            # release the lock
            if lock.owner == self.__id__:
                lock.owner = None

action Init:
    processes = []
    # global clock
    clock = 0
    lock = record(owner=None, expiration_time=-1)
    for i in range(NUM_PROCESSES):
        processes.append(Process())

Now the lock has an expiration date, so we don’t have the deadlock problem anymore. But the invariant is no longer always true.

FizzBee also has a neat view called the “Explorer” where you can step through and see how the state variables change over time. Here’s a screenshot, which shows the problem:

The problem is that one process can think it holds the lock, but it the lock has actually expired, which means another process can take the lock, and they can both end up in the critical section.

Fencing tokens

Kleppmann noted this problem with Redlock, that it was vulnerable to issues where a process’s execution could pause for some period of time (e.g., due to garbage collection). Kleppmann proposed using fencing tokens to prevent a process from accessing a shared resource with an expired lock.

Here’s how I modeled fencing tokens:

NUM_PROCESSES = 2
LEASE_LENGTH = 10

always assertion MutualExclusion:
    return not any([p1.in_cs and p2.in_cs for p1 in processes
                                          for p2 in processes
                                          if p1 != p2])

atomic action AdvanceClock:
    clock += 1

role Process:
    action Init:
        self.in_cs = False

    action Next:
        atomic:
            require lock.owner == None or \
                    clock >= lock.expiration_time
            lock = record(owner=self.__id__,
                          expiration_time=clock+LEASE_LENGTH)
            self.token = next_token
            next_token += 1

        # can only enter the critical section
        # if we have the highest token seen so far
        atomic:
            if self.token > last_token_seen:
                last_token_seen = self.token

                # critical section
                self.in_cs = True
                pass

        # after critical section
        self.in_cs = False

        # release the lock
        atomic:
            if lock.owner == self.__id__:
                lock.owner = None

action Init:
    processes = []
    # global clock
    clock = 0

    next_token = 1
    last_token_seen = 0
    lock = record(owner=None, expiration_time=-1)
    for i in range(NUM_PROCESSES):
        processes.append(Process())

However, if you run this through the model checker, you’ll discover that the invariant is also violated!

It turns out that fencing tokens don’t protect against the scenario where two processes both believe they hold the lock, and the lower token reaches the shared resource before the higher token:

A scenario where fencing tokens don’t ensure mutual exclusion

I reached out to Martin Kleppmann to ask about this, and he agreed that fencing tokens would not protect against this scenario.

Impressions

I found FizzBee surprisingly easy to get started with, although I only really scratched the surface here. In my case, having experience with PlusCal helped a lot, as I already knew how to write my specifications in a similar style. You can write your specs in TLA+ style, as a collection of atomic actions rather than as one big non-atomic action, but the PlusCal-style felt more natural for these particular problems I was modeling.

The Pythonic syntax will be much more familiar to programmers than PlusCal and TLA+, which should help with adoption. In some cases, though I found myself missing the conciseness of the set notation that languages like TLA+ and Alloy support. I ended up leveraging Python’s list comprehensions, which have a set-builder-notation feel to them.

Newcomers to formal specification will still have to learn how to think in terms of TLA+ style models: while FizzBee looks like Python, conceptually it is like TLA+, a notation for specifying a set of state-machine behaviors, which is very different from a Python program. I don’t know what it will be like for learners.

I was a little bit confused by FizzBee’s default behavior of a thread being able to crash at any arbitrary point, but that’s configurable, and I was able to use it to good effect to show deadlock in the lock model above.

Finally, while I read Kleppmann’s article years ago, I never noticed the issue with fencing tokens until I actually tried to model it explicitly. This is a good reminder of the value of formally specifying an algorithm. I fooled myself into thinking I understood it, but I actually hadn’t. It wasn’t until I went through the exercise of modeling it that I discovered something about its behavior that I hadn’t realized before.

Resilience: some key ingredients

Brian Marick posted on Mastodon the other day about resilience in the context of governmental efficiency. Reading that inspired me to write about some more general observations about resilience.

Now, people use the term resilience in different ways. I’m using resilience here in the following sense: how well a system is able to cope when it is pushed beyond its limits. Or, to borrow a term from safety researcher David Woods, when the system is pushed outside of its competence envelope. The technical term for this sense of the word resilience is graceful extensibility, which also comes from Woods. This term is a marriage of two other terms: graceful degradation, and software extensibility.

The term graceful degradation refers to the behavior of a system which, when it experiences partial failures, can still provide some functionality, even though it’s at a reduced fidelity. For example, for a web app, this might mean that some particular features are unavailable, or that some percentage of users are not able to access the site. Contrast this with a system that just returns 500 errors for everyone whenever something goes wrong.

We talk about extensible software systems as ones that have been designed to make it easy to add new features in the future that were not originally anticipated. A simple example of software extensibility is the ability for old code to call new code, with dynamic binding being one way to accomplish this.

Now, putting those two concepts together, if a system encounters some sort of shock that it can’t handle, and the system has the ability to extend itself so that it can now handle the shock, and it can make these changes to itself quickly enough that it minimizes the harms resulting from the shock, then we say the system exhibits graceful extensibility. And if it can keep extending itself each time it encounters a novel shock, then we say that the system exhibits sustained adaptability.

The rest of this post is about the preconditions for resilience. I’m going to talk about resilience in the context of dealing with incidents. Note that all of the topics described below come from the resilience engineering literature, although I may not always use the same terminology.

Resources

As Brian Marick observed in his toot:

As we discovered with Covid, efficiency is inversely correlated with resilience.

Here’s a question you can ask anyone who works in the compute infrastructure space: “How hot do you run your servers?” Or, even more meaningfully, “How much headroom do your servers have?”

Running your servers “hotter” means running at a higher CPU utilization. This means that you pack more load on fewer servers, which is more efficient. The problem is that the load is variable, which means that the hotter you run the servers, the more likely your server will get overloaded if there is a spike in utilization. An overloaded server can lead to an incident, and incidents are expensive! Running your servers at maximum utilization is running with zero headroom. We deliberately run our servers with some headroom to be able to handle variation in load.

We also see the idea of spare resources in what we call failover scenarios, where there’s a failure in one resource so we switch to using a different resource, such as failing over a database from primary to secondary, or even failing out of a geographical region.

The idea of spare resources is more general than hardware. It applies to people as well. The equivalent of headroom for humans is what Tom DeMarco refers to as slack. The more loaded humans are, the less well positioned they are to handle spikes in their workload. Stuff falls through the cracks when you’ve got too much load, and some of that stuff contributes to incidents. We can also even keep people in reserve for dealing with shocks, such as when an organization staffs a dedicated incident management team.

A common term that the safety people use for spare resources is capacity. I really like the way Todd Conklin put it on his Pre-Accident Investigation Podcast: “You don’t manage risk. You manage the capacity to absorb risk.” Another way he put it is “Accidents manage you, so what you really manage is the capacity for the organization to fail safely.”

Flexibility

Here’s a rough and ready definition of an incident: the system has gotten itself into a bad state, and it’s not going to return to a good state unless somebody does something about it.

Now, by this definition, for the system to become healthy again something about how the system works has to change. This means we need to change the way we do things. The easier it is to make changes to the system, the easier it will be to resolve the incident.

We can think of two different senses of changing the work of the system: the human side and the the software side.

Humans in a system are constrained by a set of rules that exist to reduce risk. We don’t let people YOLO code from their laptops into production, because of a number of risks that would expose us to. But incidents create scenarios where the risks associated with breaking these rules are lower than the risks associated with prolonging the incident. As a consequence, people in the system need the flexibility to be able to break the standard rules of work during an incident. One way to do this is to grant incident responders autonomy, let them make judgments about when they are able to break the rules that govern normal work, in scenarios where breaking the rule is less risky than following it.

Things look different on the software side, where all of the rules are mechanically enforced. For flexibility in software, we need to build into the software functionality in advance that will let us change the way the system behaves. My friend Aaron Blohowiak uses the term Jefferies tubes from Star Trek to describe features that support making operational changes to a system. These were service crawlways that made it easier for engineers to do work on the ship.

A simple example of this type of operational flexibility is putting in feature flags that can be toggled dynamically in order to change system behavior. At the other extreme is the ability to bring up a REPL on a production system in order to make changes. I’ve seen this multiple times in my career, including watching someone use the rails console command of a Ruby on Rails app to resolve an issue.

The technical term in resilience engineering for systems that possess this type of flexibility is adaptive capacity: the system has built up the ability to be able to dynamically reconfigure itself, to adapt, in order to meet novel challenges. This is where the name Adaptive Capacity Labs comes from.

Expertise

In general, organizations push against flexibility because it brings risk. In the case where I saw someone bring up a Ruby on Rails console, I was simultaneously impressed and terrified: that’s so dangerous!

Because flexibility carries risk, we need to rely on judgment as to whether the risk of leveraging the flexibility outweighs the risk of not using the flexibility to mitigate the incident. Granting people the autonomy to make those judgment calls isn’t enough: the people making the calls need to be able to make good judgment calls. And for that, you need expertise.

The people making these calls are having to make decisions balancing competing risks while under uncertainty and time pressure. In addition, how fluent they are with the tools is a key factor. I would never trust a novice with access to a REPL in production. But an expert? By definition, they know what they’re doing.

Diversity

Incidents in complex systems involve interactions between multiple parts of the system, and there’s no one person in your organization who understands the whole thing. To be able to effectively know what to do during an incident, you need to bring in different people who understand different parts of the system in order to help figure out what happens. You need diversity in your responders, people with different perspective on the problem at hand.

You also want diversity in diagnostic and mitigation strategy. Some people might think about recent changes, others might think about traffic pattern changes, others might dive into the codebase looking for clues, and yet others might look to see if there’s another problem going on right now that seems to be related. In addition, it’s often not obvious what the best course of action is to mitigate an incident. Responders often pursue multiple courses of action in parallel, hoping that at least one of them will bring the system healthy again. A diversity of perspectives can help generate more potential interventions, reducing the time to resolve.

Coordination

Having a group of experts with a diverse set of perspectives by itself isn’t enough to deal with an incident. For a system to be resilient, the people within the system need to be able to coordinate, to work together effectively.

If you’ve ever dealt with a complex incident, you know how challenging coordination can be. Things get even hairier in our distributed world. Whether you’re physically located with all of the responders, you’re on a Zoom call (a bridge, as we still say), you’re messaging over Slack, or some hybrid combination of all three, each type of communication channel has its benefits and drawbacks.

There are prescriptive approaches to improving coordination during incidents, such as the Incident Command System (ICS). However, Laura Maguire’s research has shown that, in practice, incident responders intentionally deviate from ICS to better manage coordination costs. This is yet another example of flexibility and expertise being employed to deal with an incident.

The next time you observe an incident, or you reflect on an incident where you were one of the responders, think back on to what extent these ingredients were present or absent. Were you able to leverage spare resources, or did you suffer from not being to? Were there operational changes that people wanted to be able to make during the incident, and were they actually able to make them? Were the responders experienced with the sub-systems they were dealing with, and how did that shape their responses? Did different people come up with different hypotheses and strategies? What is it clear to you what the different responders were doing during the incident? These issues are easy to miss if you’re not looking for them. But, once you internalize them, you’ll never be able to unsee them.

You’re missing your near misses

FAA data shows 30 near-misses at Reagan Airport – NPR, Jan 30, 2025

The amount of attention an incident gets is proportional to the severity of the incident: the greater the impact to the organization, the more attention that post-incident activities will get. It’s a natural response, because the greater the impact, the more unsettling it is to people: they worry very specifically about that incident recurring, and want to prevent that from happening again.

Here’s the problem: most of your incidents aren’t going to repeat incidents. Nobody wants an incident to recur, and so there’s a natural built-in mechanism for engineering teams to put in the effort to do preventative work. The real challenge is preventing and quickly mitigating novel future incidents, which is the overwhelming majority of your incidents.

And that brings us to near misses, those operational surprises that have no actual impact, but could have been a major incident if conditions were slightly different. Think of them as precursors to incidents. Or, if you are more poetically inclined, omens.

Because most of our incidents are novel, and because near misses are a source of insight about novel future incidents, if we are serious about wanting to improve reliability, we should be treating our near misses as first-class entities, the way we do with incidents. Yet, I’d wager that there are no tech companies out there today that would put the same level of effort into a near miss as they would to a real incident. I’d love to hear about a tech company that holds near miss reviews, but I haven’t heard any yet.

There are real challenges to treating near misses as first-class. We can generally afford to spend a lot of post-incident effort on each high-severity incident, because there generally aren’t that many of them. I’m quite confident that your org encounters many more near misses than it does high-severity incidents, and nobody has the cycles to put in the same level of effort for every near-miss as they do for every high severity incident. This means that we need to use judgment. We can’t use severity of impact to guide us here, because these near misses are, by definition, zero severity. We need to identify which near misses are worth examining further, and which ones to let go. It’s going to be a judgment call about how much we think we could potentially learn from looking further.

The other challenge is just surfacing these near misses. Because they are zero impact, it’s likely that only a handful of people in the organization are aware when a near miss happens. Treating near misses as first class events requires a cultural shift in an organization, where the people who are aware of them highlight the near miss as a potential source of insight for improving reliability. People have to see the value in sharing when these happens, it has to be rewarded or it won’t happen.

These near misses are happening in your organization right now. Some of them will eventually blossom into full-blown high-severity incidents. If you’re not looking for them, you won’t see them.