The problem with counterfactuals

Incidents make us feel uncomfortable. They remind us that we don’t have control, that the system can behave in ways that we didn’t expect. When an incident happens, the world doesn’t make sense.

A natural reaction to an incident is an effort to identify how the incident could have been avoided. The term for this type of effort is counterfactual reasoning. It refers to thinking about how, if the people involved had taken different actions, events would have unfolded differently. Here are two examples of counterfactuals:

  • If the engineer who made the code change had written a test for feature X, then the bug would never have made its way into production.
  • If the team members had paid attention to the email alerts that had fired, they would have diagnosed the problem much sooner.

Counterfactual reasoning is comforting because it restores the feeling that the world makes sense. What felt like a surprise is, in fact, perfectly comprehensible. What’s more, it could even have been avoided, if only we had taken the right actions and paid attention to the right signals.

While counterfactual reasoning helps restore our feeling that the world makes sense, the problem with it is that it doesn’t help us get better at avoiding or dealing with future incidents. The reason it doesn’t help is that counterfactual reasoning gives us an excuse to avoid the messy problem of understanding how we missed those obvious-in-retrospect actions and signals in the first place.

It’s one thing to say “they should have written a test for feature X”. It’s another thing to understand the rationale behind the engineer not writing that test. For example:

  • Did they believe that this functionality was already tested in the existing test suite?
  • Were they not aware of the existence of the feature that failed?
  • Were they under time pressure to get the code pushed into production (possibly to mitigate an ongoing issue)?

Similarly, saying “they should have paid closer to attention to the email alerts” means you might miss the fact that the email alert in question isn’t actionable 90% of the time, and so the team has conditioned themselves to ignore it.

To get better at avoiding or mitigating future incidents, you need to understand the conditions that enabled past incidents to occur. Counterfactual reasoning is actively harmful for this, because it circumvents inquiry into those conditions. It replaces “what were the circumstances that led to person X taking action Y” with “person X should have done Z instead of Y”.

Counterfactual reasoning is only useful if you have a time machine and can go back to prevent the incident that just happened. For the rest of us who don’t have time machines, counterfactual reasoning helps us feel better, but it doesn’t make us better at engineering and operating our systems. Instead, it actively prevents us from getting better.

Don’t ask “why didn’t they do Y instead of X?” Instead, ask, “how was it that doing X made sense to them at the time?” You’ll learn a lot more about the world if you ask questions about what did happen instead of focusing on what didn’t.

Experts aren’t good at building shared understanding

If only HP knew what HP knows, we would be three times more productive.

Lew Platt, former CEO of Hewlett-Packard

One pattern that you see over and over again in operational surprises is that a person who was involved in the surprise was missing some critical bit of information. For example, there may be an implicit contract that becomes violated when someone makes a code change. Or there might be a certain batch job that runs every Tuesday at 4PM might trigger and puts some additional load on the database.

Almost always, this kind of information is present in the head of someone else within the organization. It just wasn’t in the head of the person who really needed it at that moment.

I think the problem of missing information is well understood enough that you see variants of it crop in different places. Here are some examples I’ve encountered:

It turns out that experts are very good at accumulating these critical bits of information and recalling them at the appropriate time. Experts are also very good at communicating efficiently with others who share a lot of that critical information in their heads.

However, what experts are not very good at is transmitting this information to others who don’t yet have it. Experts aren’t explicitly aware of the value of all of this information, and so they tend not to volunteer it without being asked. When a newcomer watches an expert in action, a common refrain is, “how did you know to do that?”

The fact that experts aren’t good at sharing the useful information that they know is one of the challenges that incident investigators face. One of the skills of an investigator is how to elicit these bits of knowledge through interviews.

I think that advancing shared understanding in an organization has the potential to be enormously valuable. One of the things that I hope to accomplish with sharing out writeups of operational surprises is to use them as a vehicle for doing so.

Even if there isn’t a single actionable outcome from a writeup, you never know when that critical bit of knowledge that has been implanted in the heads of the readers will come in handy.

Tuning to the future

In short, the resilience of a system corresponds to its adaptive capacity tuned to the future. [emphasis added]

Branlat, Matthieu & Woods, David. (2010). How do systems manage their adaptive capacity to successfully handle disruptions? A resilience engineering perspective. AAAI Fall Symposium – Technical Report

In simple terms, an incident is a bad thing that has happened that was unexpected. This is just the sort of thing that makes people feel uneasy. Instinctively, we want to be able to say “We now understand what has happened, and we are taking the appropriate steps to make sure that this never happens again.”

But here’s the thing. Taking steps to prevent the last incident from recurring doesn’t do anything to help you deal with the next incident, because your steps will have ensured that the next one is going to be completely different. There is, however, one thing that your next incident will have in common with the last one: both of them are surprises.

We can’t predict the future, but we can get better at anticipating surprise, and dealing with surprise when it happens. Getting better at dealing with surprise is what resilience engineering is all about.

The first step is accepting that surprise is inevitable. That’s hard to do. We want to believe that we are in control of our systems, that we’ve plugged all of the holes. Sure, we may have had a problem before, but we fixed that. If we can just take the time to build it right, it’ll work properly.

Accepting that future operational surprises are inevitable isn’t natural for engineers. It’s not the way we think. We design systems to solve problems, and one of the problems is staying up. We aren’t fatalists.

However, once we do accept that operational surprise is inevitable, we can shift our thinking of the system from the computer-based system to the broader socio-technical system that includes both the people and the computers. The solution space here looks very different, because we aren’t used to thinking about designing systems where people are part of the system, especially when we engineers are part of the system we’re building!

But if we want the ability to handle things the future is going to throw at us, then we need to get better at dealing with surprise. Computers are lousy at this, they can’t adapt to situations they weren’t designed to handle. But people can.

In this frame, accepting that operational surprises are inevitable isn’t fatalism. Building adaptive capacity to deal with future surprises is how we tune to the future.