Why you can’t buy an observability solution off the shelf

An effective observability solution is one that helps an operator quickly answer the question “Why is my system behaving this way?” This is a difficult problem to solve in the general case because it requires the successful combination of three very different things.

Most talk about observability is around vendor tooling. I’m using observability tooling here to refer to the software that you use for collecting, querying, and visualizing signals. There are a lot of observability vendors these days (for example, check out Gartner’s APM and Observability category). But the tooling is only one part of the story: at best, it can provide you with a platform for building an observability solution. The quality of that solution is going to depend on how good your system model is and your cognitive & perceptual model is that it’s encoded in your solution.

A good observability solution will have embedded within it a hierarchical model of your system. This means it needs to provide information about what the system is supposed to do (functional purpose), the subsystems that make up the system, all of the way down to the low-level details of the components (specific lines of code, CPU utilization on the boxes, etc).

As an example, imagine your org has built a custom CI/CD system that you want to observe. Your observability solution should give you feedback about whether the system is actually performing its intended functions: Is it building artifacts and publishing them to the artifact repository? Is it promoting them across environments? Is it running verifications against them?

Now, imagine that there’s a problem with the verifications: it’s starting them but isn’t detecting that they’re complete. The observability solution should enable you to move from that function (verifications) to the subsystem that implements it. That subsystem is likely to include both application logic and external components (e.g., Temporal, MySQL). Your observability solution should have encoded within it the relationships between the functional aspects and the relevant implementations to support the work of moving up and down the abstraction hierarchy while doing the diagnostic work. Building this model can’t be outsourced to a vendor: ultimately, only the engineers who are familiar with the implementation details of the system can create this sort of hierarchical model.

Good observability tooling and a good system model alone aren’t enough. You need to understand how operators do this sort of diagnostic work in order to build an interface that supports this sort of work well. That means you need a cognitive model of the operator to understand how the work gets done. You also need a perceptual model of how it is that humans effectively navigate through the world so you can leverage this model to build an interface that will enable operators to move fluidly across data sources from different systems. You need to build an operational solution with high visual momentum to avoid an operator needing to navigate through too many displays.

Every observability solution has an implicit system model and an implicit cognitive & perceptual model. However, while work on the abstraction hierarchy dates back to the 1980s, and visual momentum goes back to the 1990s, I’ve never seen this work, or the field of cognitive systems engineering in general, explicitly referenced. This work remains largely unknown in the tech world.

Does any of this sound familiar?

The other day, David Woods responded to one of my tweets with a link to a talk:

claim of root cause is ex. of component substitution fallacy. All incidents that threaten failure reveal component weaknesses due to finite
resources & tradeoffs -> easy to miss the critical systemic/emergent factors see min 25 https://t.co/OsYy2U8fsA
— David Woods (@ddwoods2) January 16, 2023

I had planned to write a post on the component substitution fallacy that he referenced, but I didn’t even make it to minute 25 of that video before I heard something else that I had to post first, at the 7:54 mark. The context of it is describing the state of NASA as an organization at the time of the Space Shuttle Columbia accident.

And here’s what Woods says in the talk:

Now, does this describe your organization?

Are you in a changing environment under resource pressures and new performance demands?

Are you being pressured to drive down costs, work with shorter, more aggressive schedules?

Are you working with new partners and new relationships where there are new roles coming into play, often as new capabilities come to bear?

Do we see changing skillsets, skill erosion in some areas, new forms of expertise that are needed?

Is there heightened stakeholder interest?

Finally, he asks:

How are you navigating these seas of complexity that NASA confronted?

How are you doing with that?

A small mistake does not a complex systems failure make

I’ve been trying to take a break from Twitter lately, but today I popped back in, only to be trolled by a colleague of mine:

@norootcause, it looks like “they found the root cause” of that FAA outage 😜 https://t.co/dBQzaVnGPk
— Bertrand MT (@bertrandmt) January 15, 2023

Here’s a quote from the story:

The source of the problem was reportedly a single engineer who made a small mistake with a file transfer.

Here’s what I’d like you to ponder, dear reader. Think about all of the small mistakes that happen every single day, at every workplace, on the face of the planet. If a small mistake was sufficient to take down a complex system, then our systems would be crashing all of the time. And, yet, that clearly isn’t the case. For example, before this event, when was the last time the FAA suffered a catastrophic outage?

Now, it might be the case that no FAA employees have ever made a small mistake until now. Or, more likely, the FAA system works in such a way that small mistakes are not typically sufficient for taking down the entire system.

To understand this failure mode, you need to understand how it is that the FAA system is able to stay up and running on a day-to-day basis, despite the system being populated by fallible human beings who are capable of making small mistakes. You need to understand how the system actually works in order to make sense of how a large-scale failure can happen.

Now, I’ve never worked in the aviation industry, and consequently I don’t have domain knowledge about the FAA system. But I can tell you one thing: a small mistake with a file transfer is a hopelessly incomplete explanation for how the FAA system actually failed.

The Allspaw-Collins effect

While reading Laura Maguire’s PhD dissertation, Controlling the Costs of Coordination in Large-scale Distributed Software Systems, when I came across a term I hadn’t heard before: the Allspaw-Collins effect:

An example of how practitioners circumvent the extensive costs inherent in the Incident Command model is the Allspaw-Collins effect (named for the engineers who first noticed the pattern). This is commonly seen in the creation of side channels away from the group response effort, which is “necessary to accomplish cognitively demanding work but leaves the other participants disconnected from the progress going on in the side channel (p.81)”

Here’s my understanding:

A group of people responding to an incident have to process a large number of signals that are coming in. One example of such signals is the large number of Slack messages appearing in the incident channel as incident responders and others provide updates and ask questions. Another example would be additional alerts firing.

If there’s a designated incident commander (IC) who is responsible for coordination, the IC can become a bottleneck if they can’t keep up with the work of processing all of these incoming signals.

The effect captures how incident responders will sometimes work around this bottleneck by forming alternate communication channels so they can coordinate directly with each other, without having to mediate through the IC. For example, instead of sending messages in the main incident channel, they might DM each other or communicate in a separate (possibly private) channel.

I can imagine how this sort of side-channel communication would be officially sanctioned (“all incident-related communication should happen in the incident response channel!“), and also how it can be adaptive.

Maguire doesn’t give the first names of the people the effect is named for, but I strongly suspect they are John Allspaw and Morgan Collins.