You’re missing your near misses

FAA data shows 30 near-misses at Reagan Airport – NPR, Jan 30, 2025

The amount of attention an incident gets is proportional to the severity of the incident: the greater the impact to the organization, the more attention that post-incident activities will get. It’s a natural response, because the greater the impact, the more unsettling it is to people: they worry very specifically about that incident recurring, and want to prevent that from happening again.

Here’s the problem: most of your incidents aren’t going to repeat incidents. Nobody wants an incident to recur, and so there’s a natural built-in mechanism for engineering teams to put in the effort to do preventative work. The real challenge is preventing and quickly mitigating novel future incidents, which is the overwhelming majority of your incidents.

And that brings us to near misses, those operational surprises that have no actual impact, but could have been a major incident if conditions were slightly different. Think of them as precursors to incidents. Or, if you are more poetically inclined, omens.

Because most of our incidents are novel, and because near misses are a source of insight about novel future incidents, if we are serious about wanting to improve reliability, we should be treating our near misses as first-class entities, the way we do with incidents. Yet, I’d wager that there are no tech companies out there today that would put the same level of effort into a near miss as they would to a real incident. I’d love to hear about a tech company that holds near miss reviews, but I haven’t heard any yet.

There are real challenges to treating near misses as first-class. We can generally afford to spend a lot of post-incident effort on each high-severity incident, because there generally aren’t that many of them. I’m quite confident that your org encounters many more near misses than it does high-severity incidents, and nobody has the cycles to put in the same level of effort for every near-miss as they do for every high severity incident. This means that we need to use judgment. We can’t use severity of impact to guide us here, because these near misses are, by definition, zero severity. We need to identify which near misses are worth examining further, and which ones to let go. It’s going to be a judgment call about how much we think we could potentially learn from looking further.

The other challenge is just surfacing these near misses. Because they are zero impact, it’s likely that only a handful of people in the organization are aware when a near miss happens. Treating near misses as first class events requires a cultural shift in an organization, where the people who are aware of them highlight the near miss as a potential source of insight for improving reliability. People have to see the value in sharing when these happens, it has to be rewarded or it won’t happen.

These near misses are happening in your organization right now. Some of them will eventually blossom into full-blown high-severity incidents. If you’re not looking for them, you won’t see them.

5 thoughts on “You’re missing your near misses

  1. I suppose this is one of the few times where you are advocating for counterfactuals?

    Because normally the problem with counterfactuals (per your other blog post) is:

    it doesn’t help us get better at avoiding or dealing with future incidents

    But in this case it does.

    I think your point on Judgement is the real key though. If I was making a judgement call for whether we do a retro on every near-miss I would say “no” 100% of the time! My reasoning would be that we have enough work to do already, and most of the type of change we make is constructed to be incrementally deployed in some way.

    Or how about Crowdstrike? Every deploy was a near miss!

    This is in contrast to the aviation industry where almost any incident is unacceptable.

    Do you have any opinions on how this near-miss philosophy jives with your counterfactual philosophy?

    1. I suppose this is one of the few times where you are advocating for counterfactuals?

      I admit that there is a counterfactual nature to near misses. The counterfactual element here is just a tool to get people’s attention to examine how the work is done in more detail. “Hey, we almost had an incident, let’s take a close look at how some aspect of the work gets done” is more effective at getting people’s attention than just randomly saying, “Hey, let’s take a close look at how some aspect of the work gets done.”

      I think the focus should still be on “is this potentially a future risk?”. The counterfactual element (“would it have been an incident if things were a little different”) isn’t as important as “does this reveal something that could become an incident in the future?”

      If I was making a judgement call for whether we do a retro on every near-miss I would say “no” 100% of the time! My reasoning would be that we have enough work to do already, and most of the type of change we make is constructed to be incrementally deployed in some way.

      Yep, same! This is one of the hardest aspects of this type of work: which threads are worth spending the effort to pull on?

      Or how about Crowdstrike? Every deploy was a near miss!

      I use the term “near miss” to only refer to things we recognize as near misses. In that sense, the previous Crowdstrike deploys weren’t near misses, assuming nobody recognized the risk earlier on.

      This is in contrast to the aviation industry where almost any incident is unacceptable.

      Aviation is an interesting example, because they receive a lot of weak signals and have to make judgment calls about which ones to investigate further. For example, there’s the Aviation Safety Reporting System (ASRS). A great first-hand on the history of the ASRS’s creation is in the appendix of A Tale of Two Stories: Contrasting Views of Patient Safety. Notably, the medical field has failed to come up with an equivalent system (and the guy who came up with the ASRS was a doctor!).

      Another good source, about how individual airlines investigate these weak signals, is the book Close Calls: Managing Risk and Resilience in Airline Flight Safety by the researcher Cal Macrae.

  2. Pingback: alper.nl
  3. Pilot and doctor here- the differences between aviation’s obsessive safety culture and medicine’s head-in-the-sand denialism is the most striking contrast between my two careers (spanning a combined 45 years). The difference is the FAA-driven compliance culture, and the capitalist health care system that only cares about bottom lines. Ask yourself, with all of it’s scandals, would any smart American even consider putting Boeing in charge of safety?

Leave a reply to Lorin Hochstein Cancel reply