When a bad analysis is worse than none at all

One of the most famous physics experiments in modern history is the double-split experiment, originally performed by the English physicist Thomas Young back in 1801. You probably learned about this experiment in a high school physics class. There was a long debate in physics about whether light was a particle or a wave, and Young’s experiment provided support for the wave theory. (Today, we recognize that light has a dual nature, with both particle-like and wave-like behaviors).

To run the experiment, you need an opaque board that has two slits cut out of it, as well as a screen. You shine a light at the board and look to see what the pattern of light looks like on the screen behind it.

Here’s a diagram from Wikipedia, which shows the experiment being run with electrons rather than light , but is otherwise the same idea.

Original: NekoJaNekoJa Vector: Johannes Kalliauer, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

If light was a particle, then you would expect each light particle to pass through either one slit, or the other. The intensities that you’d observe on the screen would look like the sum of the intensities if you ran the experiment by covering up one slit, and then ran it again by covering up the other slit. It should basically look like the sum of two Gaussian distributions with different means.

However, that isn’t what you actually see on the screen. Instead, you get this pattern where there are some areas of the screen with no intensity at all: where the light never strikes the screen. On the other hand, if you run the experiment by covering up either slit, you will get light at these null locations. This shows that there’s an interference effect, the fact that there are two slits leads the light to behave differently from being the sum of the effects of each slit.

Note that we see the same behavior with electrons (hence the diagram above). Both electrons and light (photons) exhibit this sort of wavelike behavior. This behavior is observed even even if you shine only one electron (or photon) at a time through the slits.

Now, imagine a physicist in the 1970s hires a technician to run this experiment with electrons. The physicist asks the tech to fire one electron at a time from an electron gun at the double-slit board, and record the intensities of the electrons striking a phosphor screen, like on a cathode ray tube (kids, ask your parents about TVs in the old days). Imagine that the physicist doesn’t tell the technicians anything about the theory being tested, the technician is just asked to record the measurements.

Let’s imagine this thought process from the technician:

It’s a lot of work to record the measurements from the phosphor screen, and all of this intensity data is pretty noisy anyways. Instead, why don’t I just identify the one location on the screen that was the brightest, use that location to estimate which slit the electron was most likely to have passed through, and then just record that slit? This will drastically reduce the effort required for each experiment. Plus, the resulting data will be a lot simpler to aggregate than the distribution of messy intensities from each experiment.

The data that the technician records then ends up looking like this:

ExperimentSlit
1left
2left
3right
4left
5right
6left

Now, the experimental data above will give you no insight into the wave nature of electrons, no matter many experiments are run. This sort of experiment is clearly not better than nothing, it’s worse than nothing, because it obscures the nature of the phenomenon that you’re trying to study!

Now, here’s my claim: when people say “the root cause analysis process may not be perfect, but it’s better than nothing”, this is what I worry about. They are making implicit assumptions about a model of incident failure (there’s a root cause), and the information that they are capturing about the incidents is determined by this model.

A root cause analysis approach will never provide insight into how incidents arise through complex interactions, because it intentionally discards the data that could provide that insight. It’s like the technician who does not record all of the intensity measurements, and instead just uses those measurements to pick a slit, and only records the slit.

The alternative is to collect a much richer set of data from each incident. That more detailed data collection is going to be a lot more effort, and a lot messier. It’s going to involve recording details about people’s subjective observations and fuzzy memories, and it will depend on what types of questions are asked of the responders. It will also depend on what sorts of data you even have available to capture. And there will be many subjective decisions about what data to record and what to leave out.

But if your goal is to actually get insights from your incidents about how they’re happening, then that effortful, messy data collection will reveal insights that you won’t ever get from a root cause analysis. Whereas, if you continue to rely on root cause analysis, you are going to be misled about how your system actually fails and how it really works. This is what I mean by good models protect us from bad models, and how root cause analysis can actually be worse than nothing.

Don’t be like the technician, discarding the messy data because it’s cleaner to record which slit the electron went through. Because then you’ll miss that the electron is somehow going through both.

Leave a comment