There’s a management aphorism that goes “you can’t manage what you can’t measure”. It is … controversial. W. Edwards Deming, for example, famously derided it. But I think there are two ways to interpret this quote, and they have very different takeaways.
One way to read this is to treat the word measure as a synonym for quantify. When John Allspaw rails against aggregate metrics like mean time to resolve (MTTR), he is siding with Deming in criticizing the idea of relying solely on aggregate, quantitative metrics for gaining insight into your system.
But there’s another way to interpret this aphorism, and it depends on an alternate interpretation of the word measure. I think that observing any kind of signal is a type of measurement. For example, if you’re having a conversation with someone, and you notice something in their tone of voice or their facial expression, then you’re engaged in the process of measurement. It’s not quantitative, but it represents information you’ve collected that you didn’t have before.
By generalizing the concept of measurement, I would recast this aphorism as: what you aren’t aware of, you can’t take into account.
This may sound like a banal observation, but the subtext here is “… and there’s a lot you aren’t taking into account.” A lot of things that are happening in your organization, your system, are largely invisible. And those things, that work, is keeping things up and running.
The concept that there’s invisible work happening that’s creating your availability is at the heart of the learning from incidents in software movement. And it isn’t obvious, even though we all experience it directly.
This invisible work is valuable in the sense that it’s contributing to keeping your system healthy. But the fact that it’s invisible is dangerous because it can’t be taken into account when decisions are made that change the system. For example, I’ve seen technological changes that have made it more difficult for the incident management team to diagnose what’s happening in the system. The teams who introduced those changes were not aware of how the folks on the incident management team were doing diagnostic work.
In particular, one of the dangers of an action-item-oriented approach to incident reviews is that you may end up introducing a change to the system that disrupts this invisible work.
Take the time to learn about the work that’s happening that nobody else sees. Because if you don’t see it, you may end up breaking it.