Why you can’t buy an observability solution off the shelf

An effective observability solution is one that helps an operator quickly answer the question “Why is my system behaving this way?” This is a difficult problem to solve in the general case because it requires the successful combination of three very different things.

Most talk about observability is around vendor tooling. I’m using observability tooling here to refer to the software that you use for collecting, querying, and visualizing signals. There are a lot of observability vendors these days (for example, check out Gartner’s APM and Observability category). But the tooling is only one part of the story: at best, it can provide you with a platform for building an observability solution. The quality of that solution is going to depend on how good your system model is and your cognitive & perceptual model is that it’s encoded in your solution.

A good observability solution will have embedded within it a hierarchical model of your system. This means it needs to provide information about what the system is supposed to do (functional purpose), the subsystems that make up the system, all of the way down to the low-level details of the components (specific lines of code, CPU utilization on the boxes, etc).

As an example, imagine your org has built a custom CI/CD system that you want to observe. Your observability solution should give you feedback about whether the system is actually performing its intended functions: Is it building artifacts and publishing them to the artifact repository? Is it promoting them across environments? Is it running verifications against them?

Now, imagine that there’s a problem with the verifications: it’s starting them but isn’t detecting that they’re complete. The observability solution should enable you to move from that function (verifications) to the subsystem that implements it. That subsystem is likely to include both application logic and external components (e.g., Temporal, MySQL). Your observability solution should have encoded within it the relationships between the functional aspects and the relevant implementations to support the work of moving up and down the abstraction hierarchy while doing the diagnostic work. Building this model can’t be outsourced to a vendor: ultimately, only the engineers who are familiar with the implementation details of the system can create this sort of hierarchical model.

Good observability tooling and a good system model alone aren’t enough. You need to understand how operators do this sort of diagnostic work in order to build an interface that supports this sort of work well. That means you need a cognitive model of the operator to understand how the work gets done. You also need a perceptual model of how it is that humans effectively navigate through the world so you can leverage this model to build an interface that will enable operators to move fluidly across data sources from different systems. You need to build an operational solution with high visual momentum to avoid an operator needing to navigate through too many displays.

Every observability solution has an implicit system model and an implicit cognitive & perceptual model. However, while work on the abstraction hierarchy dates back to the 1980s, and visual momentum goes back to the 1990s, I’ve never seen this work, or the field of cognitive systems engineering in general, explicitly referenced. This work remains largely unknown in the tech world.

Incident categories I’d like to see

If you’re categorizing your incidents by cause, here are some options for causes that I’d love to see used. These are all taken directly from the field of cognitive systems engineering research.

Production pressure

All of us are so often working near saturation: we have more work to do than time to do it. As a consequence, we experience pressure to get that work done, and the pressure affects how we do our work and the decisions we make. Multi-tasking is a good example of a symptom of production pressure.

Ask yourself “for the people whose actions contributed to the incident, what was their personal workload like? How did it shape their actions?”

Goal conflicts

Often we’re trying to achieve multiple goals while doing our work. For example, you may have a goal to get some new feature out quickly (production pressure!), but you also have a goal to keep your system up and running as you make changes. This creates a goal conflict around how much time you should put into validation: the goal of delivering features quickly pushes you towards reducing validation time, and the goal of keeping the system up and running pushes you towards increasing validation time.

If someone asks “Why did you take action X when it clearly contravenes goal G?”, you should ask yourself “was there another important goal, G1, that this action was in support of?”

Workarounds

How do you feel about the quality of the software tools that you use in order to get your work done? (As an example: how are the deployment tools in your org?)

Often the tools that we use are inadequate in one way or another, and so we resort to workarounds: getting our work done in a way that works but is not the “right” way to do it (e.g., not how the tool was designed to be used, against the official process of how to do things). Using workarounds is often dangerous because the system wasn’t designed with that type of work in mind. But if the dangerous way of doing work is the only way that the work can get done, then you’re going to end up with people taking dangerous actions.

If an incident involves someone doing something they weren’t “supposed to”, you should ask yourself, “did they do it this way because they are working around some deficiency in the tools that have to use?”

Automation surprises

Software automation often behaves in ways that people don’t expect: we have incorrect mental models of why the system is doing what it is, often because the system isn’t designed in a way to make it easy for us to form good mental models of behavior. (As someone who works on a declarative deployment system, I acutely feel the pain we can inflict on our users in this area).

If someone took the “wrong” action when interacting with a software system in some way, ask yourself “what was their understanding of the state of the world at the time, and what was their understanding of what the result of that action would be? How did they form their understanding of the system behavior?”


Do you find this topic interesting? If so, I bet you’ll enjoy attending the upcoming Learning from Incidents Conference taking place on Feb 15-16, 2023 in Denver, CO.

There is no “Three Mile Island” event coming for software

In Critical Digital Services: An Under-Studied Safety-Critical Domain, John Allspaw asks:

Critical digital services has yet to experience its “Three-Mile Island” event. Is
such an accident necessary for the domain to take human performance seriously? Or can it translate what other domains have learned and make productive use of
those lessons to inform how work is done and risk is anticipated for the future?

I don’t think the software world will ever experience such an event.

The effect of TMI

The Three Mile Island accident (TMI) is notable, not because of the immediate impact on human lives, but because of the profound effect it had on the field of safety science.

Before TMI, the prevailing theories of accidents was that they were because of issues like mechanical failures (e.g., bridge collapse, boiler explosion), unsafe operator practices, and mixing up physical controls (e.g., switch that lowers the landing gear looks similar to switch that lowers the flaps).

But TMI was different. It’s not that the operators were doing the wrong things, but rather that they did the right things based on their understanding of what was happening, but their understanding of what was happening, which was based on the information that they were getting from their instruments, didn’t match reality. As a result, the actions that they took contributed to the incident, even though they did what they were supposed to do. (For more on this, I recommend watching Richard Cook’s excellent lecture: It all started at TMI, 1979).

TMI led to a kind of Cambrian explosion of research into human error and its role in accidents. This is the beginning of the era where you see work from researchers such as Charles Perrow, Jens Rasmussen, James Reason, Don Norman, David Woods, and Erik Hollnagel.

Why there won’t be a software TMI

TMI was significant because it was an event that could not be explained using existing theories. I don’t think any such event will happen in a software system, because I think that every complex software system failure can be “explained”, even if the resulting explanation is lousy. No matter what the software failure looks like, someone will always be able to identify a “root cause”, and propose a solution (more automation, better procedures). I don’t think a complex software failure is capable of creating TMI style cognitive dissonance in our industry: we’re, unfortunately, too good at explaining away failures without making any changes to our priors.

We’ll continue to have Therac-25s, Knight Capitals, Air France 447s, 737 Maxs, 911 outages, Rogers outages, and Tesla autopilot deaths. Some of them will cause enormous loss of human life, and will result in legislative responses. But no such accident will compel the software industry to, as Allspaw puts it, take human performance seriously.

Our only hope is that the software industry eventually learns the lessons that the safety science learned from the original TMI.

Up and down the abstraction hierarchy

As operators, when the system we operate is working properly, we use a functional description of the system to reason about its behavior.

Here’s an example, taken from my work on a delivery system. if somebody asks me, “Hey, Lorin, how do I configure my deployment so that a canary runs before it deploys to production?”, then I would tell them, “In your deliver config, add a canary constraint to the list of constraints associated with your production environment, and the delivery system will launch a canary and ensure it passes before promoting new versions to production.”

This type of description is functional; It’s the sort of verbiage you’d see in a functional spec. On the other hand, if an alert fires because the environment check rate has dropped precipitously, the first question I’m going to ask is, “did something deploy a code change?” I’m not thinking about function anymore, but I’m thinking of the lowest level of abstraction.

In the mid nineteen-seventies, the safety researcher Jens Rasmussen studied how technicians debugged electronic devices, and in the mid-eighties he proposed a cognitive model about how operators reason about a system when troubleshooting, in a paper titled the role of hierarchical knowledge representation in decisionmaking and system management. He called this model the abstraction hierarchy.

Rasmussen calls this model a means-ends hierarchy, where the “ends” are at the top (the function: what you want the system to do), and the “means” are at the bottom (how the system is physically realized). We describe the proper function of the system top-down, and when we successfully diagnose a problem with the system, we explain the problem bottom-up.

The abstraction hierarchy, explained with an example

Depicts the five levels of the abstraction hierarchy:

1. functional purpose
2. abstract functions
3. general functions
4. physical functions
5. physical form
The five levels of the abstraction hierarchy

To explain these, I’ll use the example of a car.

The functional purpose of the car is to get you from one place to another. But to make things simpler, let’s zoom in on the accelerator. The functional purpose of the accelerator is to make the car go faster.

The abstract functions include transferring power from the car’s power source to the wheels, as well as transferring information from the accelerator to that system about how much power should be delivered. You can think of abstract functions as being functions required to achieve the functional purpose.

The generalized functions are the generic functional building blocks you use to implement the abstract functions. In the case of the car, you need a power source, you need a mechanism for transforming the stored energy to mechanical energy, a mechanism for transferring the mechanical energy to the wheels.

The physical functions capture how the generalized function is physically implemented. In an electric vehicle, your mechanism for transforming stored energy to mechanical energy is an electric motor; in a traditional car, it’s an internal combustion engine.

The physical form captures the construction detail of how the physical function. For example, if it’s an electric vehicle that uses an electric motor, the physical form includes details such as where the motor is located in the car, what its dimensions are, and what materials it is made out of.

Applying the abstraction hierarchy to software

Although Rasmussen had physical systems in mind when he designed the hierarchy (his focus was on process control, and he worked at a lab that focused on nuclear power plants), I think the model can map onto software systems as well.

I’ll use the deployment system that I work on, Managed Delivery, as an example.

The functional purpose is to promote software releases through deployment environments, as specified by the service owner (e.g., first deploy to test environment, then run smoke tests, then deploy to staging, wait for manual judgment, then run a canary, etc.)

Here are some examples of abstract functions in our system.

  • There is an “environment check” control loop that evaluates whether each pending version of code is eligible for promotion to the next environment by checking its constraints.
  • There is a subsystem that listens for “new build” events and stores them in our database.
  • There is a “resource check” control loop that evaluates whether the currently deployed version matches the most recent eligible version.

For generalized functions, here are some larger scale building blocks we use:

  • a queue to consume build events generated by the CI system
  • a relational database to track the state of known versions
  • a workflow management system for executing the control loops

For the physical functions that realize the generalized functions:

  • SQS as our queue
  • MySQL Aurora as our relational database
  • Temporal as our workflow management system

For physical form, I would map these to:

  • source code representation (files and directory structure)
  • binary representation (e.g., container image, Debian package)
  • deployment representation (e.g., organization into clusters, geographical regions)

Consider: you don’t care about how your database is implemented, until you’re getting some sort of operational problem that involves the database, and then you really have to care about how it’s implemented to diagnose the problem.

Why is this useful?

If Rasmussen’s model is correct, then we should build operator interfaces that take the abstraction hierarchy into account. Rasmussen called this approach ecological interface design (EID), where the abstraction hierarchy is explicitly represented in the user interface, to enable operators to more easily navigate the hierarchy as they do their troubleshooting work.

I have yet to see an operator interface that does this well in my domain. One of the challenges is that you can’t rely solely on off-the-shelf observability tooling, because you need to have a model of the functional purpose and the abstract functions to build those models explicitly into your interface. This means that what we really need are toolkits so that organizations can build custom interfaces that can capture those top levels well. In addition, we’re generally lousy at building interfaces that traverse different levels: at best we have links from one system to another. I think the “single pane of glass” marketing suggests that people have some basic understanding of the problem (moving between different systems is jarring), but they haven’t actually figured out how to effectively move between levels in the same system.

Incident response is a team sport

Once upon a time, whenever I was involved in responding to an incident, and a teammate ended up diagnosing the failure mode, I would kick myself afterwards. How come I couldn’t figure out what was wrong? Why hadn’t I thought to do what they had done?

However, after enough exposure to the cognitive systems engineering literature, something finally clicked in my mind. When a group of people respond to an incident, it’s never the responsibility of a single individual to remediate. It can’t be, because we each know our own corners of the system better than our teammates. Instead, it is the responsibility of the group of incident responders as a whole to resolve the incident.

The group of incident responders, that ad-hoc team that forms in the moment, is what’s referred to as a joint cognitive system. It’s the responsibility of the individual responders to coordinate effectively so that the cognitive system can solve the problem. Often that involves dynamically distributing the workload so that individuals can focus on specific tasks.

Resolving incidents is a team effort. Go team!

Imagine there’s no human error…

When an incident happens, one of the causes is invariably identified as human error: somebody along the way made a mistake, did something they shouldn’t have done. For example: that engineer shouldn’t have done that clearly risky deployment and then walked away without babysitting it. Labeling an action as human error is an unfortunately effective way at ending an investigation (root cause: human error).

Some folks try to make progress on the current status quo by arguing that, since human error is inevitable (people make mistakes!), it should be the beginning of the investigation, rather than the end. I respect this approach, but I’m going to take a more extreme view here: we can gain insight into how incidents happen, even those that involve operator actions as contributing factors, without reference to human error at all.

Since we human beings are physical beings, you can think of us as machines. Specifically, we are machines that make decisions and take action based on those decisions. Now, imagine that every decision we make involves our brain trying to maximize a function: when provided with a set of options, it picks the one that has the largest value. Let’s call this function g, for goodness.

Pushing button A has a higher value of g than pushing button B.

(The neuroscientist Karl Friston has actually proposes something similar as a theory: organisms make decisions to minimize model surprise, a construct that Friston calls free energy).

In this (admittedly simplistic) model of human behavior, all decision making is based on an evaluation of g. Each person’s g will vary based on their personal history and based on their current context: what they currently see and hear, as well as other factors such as time pressure and conflicting goals. “History” here is very broad, as g will vary based not only on what you’ve learned in the past, but also on physiological factors like how much sleep you had last night and what you ate for breakfast.

Under this paradigm, if one of the contributing factors in an incident was the user pushing “A” instead of “B”, we ask “how did the operator’s g function score a higher value for pushing A over B”? There’s no concept of “error” in this model. Instead, we can explore the individual’s context and history to get a better understanding of how their g function valued A over B. We accomplish this by talking to them.

I think the model above is much more fruitful than the one where we identify errors or mistakes. In this model, we have to confront the context and the history that a person was exposed to, because those are the factors that determine how decisions get made.

The idea of human error is a hard thing to shake. But I think we’d be better off if we abandoned it entirely.

Some additional reading on the idea of human error:

The ambiguity of real work

All ambiguity is resolved by actions of practitioners at the sharp end of the system.

Dr. Richard I. Cook, How Complex Systems Fail

There’s a wonderful book by the late urban planning professor Donald Schön called The Reflective Practitioner: How Professionals Think in Action. In the first chapter, he discusses the “rigor or relevance” dilemma that faces educators in professional degree programs. In the case of a university program aimed at preparing students for a career in software development, this is the “should we teach topological sort or React?” question.

Schön argues that the dilemma itself is a fundamental misunderstanding of the nature of professional work. What it misses is the ambiguity and uncertainty inherent in the work of professional life. The “rigor vs relevance” debate is an argument over the best way to get from the problem to the solution: do you teach the students first principles, or do you teach them how to use the current set of tools? Schön observes that a more significant challenge for professionals is defining the problems to solve in the first place, since an ill-defined problem admits no technical solution at all.

In the varied topography of professional practice, there is a high, hard ground where practitioners can make effective use of research-based theory and technique, and there is a swampy lowland where situations are confusing “messes” incapable of technical solution. The difficulty is that the problems of the high ground, however great their technical interest, are often relatively unimportant to clients or to the larger society, while in the swamp are the problems of greatest human concern.

His use of the term “messes” evokes Russell Ackoff’s use of the term in his paper The Future of Operational Research is Past:

Managers are not confronted with problems that are independent of each other, but with dynamic situations that consist of complex systems of changing problems that interact with each other. I call such situations messes. Problems are abstractions extracted from messes by analysis; they are to messes as atoms are to tables and chairs. We experience messes, tables, and chairs; not problems and atoms

To take another example from the software domain. Imagine that you’re doing quarterly planning, and there’s a collection of reliability work that you’d like to do, and you’re trying to figure out how to prioritize it. You could apply a rigorous approach, where you quantify some values in order to do the prioritization work, and so you try to estimate information like:

  • the probability of hitting a problem if the work isn’t done
  • the cost to the organization if the problem is encountered
  • the amount of effort involved in doing the reliability work

But you’re soon going to discover the enormous uncertainty involved in trying to put a number on any of those things. And, in fact, doing any reliability work can actually introduce new failure modes.

Over and over, I’ve seen the theme of ambiguity and uncertainty appear in ethnographic research that looks at professional work in action. In Designing Engineers, the aerospace engineering professor Louis Bucciarelli did an ethnographic study of engineers in a design firm, and discovered that the engineers all had partial understanding of the problem and solution space, and that their understandings also overlapped only partially. As a consequence, a lot of the engineering work that was done actually involved engineers resolving their incomplete understanding through various forms of communication, often informal. Remarkably, the engineers were not themselves aware of this process of negotiating understandings of the problems and solutions.

The famous Common Ground and Coordination in Joint Activity paper by Gary Klein, Paul Feltovich, and David Woods, makes explicit the role that ambiguity plays in human coordination and communication.

You’ll sometimes hear researchers who study work talk about the process of sensemaking. For example, there’s a paper by Sana Albolino, Richard Cook, and Micahel O’Connor called Sensemaking, safety, and cooperative work in the intensive care unit that describes this type of work in an intensive care unit. I think of sensemaking as an activity that professionals perform to try to resolve ambiguity and uncertainty.

(Ambiguity isn’t always bad. In the book On Line and On Paper, the sociologist Kathryn Henderson describes how engineers use engineering drawings as boundary objects. These are artifacts are that are understood differently by the different stakeholders: two engineers looking at the same drawing will have different mental models of the artifact based on their own domain expertise(!). However, there is also overlap in their mental models, and it is this combination of overlap and the fact that individuals can use the same artifact for different purposes that makes it useful. Here the ambiguity has actual value! In fact, her research shows that computer models, which eliminate the ambiguity, were less useful for this sort of work).

As practitioners, we have no choice: we always have to deal with ambiguity. As noted by Richard Cook in the quote that opens this blog post, we are the ones, at the sharp end, that are forced to resolve it.

Trouble during startup

I asked this question on twitter today:

I received a lot of great responses.

My question was motivated by this lecture by Dr. Richard Cook about an explosion at a BP petroleum processing plant in Texas in 2005 that killed fifteen plant workers. Here’s a brief excerpt, starting at 16:31 of the video:


Let me make one other observation about this that I think is important, which is that this occurred during startup. That is, once these processes get going, they work in a way that’s different than starting them up. So starting up the process requires a different set of activities than running it continuously. Once you have it running continuously, you can be pouring stuff in one end and getting it out the other, and everything runs smoothly in-between. But startup doesn’t, it doesn’t have things in it, so you have to prime all the pumps by doing a different set of operations.


This is yet another reminder of how similar we are to other fields that control processes.

Controlling a process we don’t understand

I was attending the Resilience Engineering Association – Naturalistic Decision Making Symposium last month, and one of the talks was by a medical doctor (an anesthesiologist) who was talking about analyzing incidents in anesthesiology. I immediately thought of Dr. Richard Cook, who is also an anesthesiologist, who has been very active in the field of resilience engineering, and I wondered, “what is it with anesthesiology and resilience engineering?” And then it hit me: it’s about process control.

As software engineers in the field we call “tech”, we often discuss whether we are really engineers in the same sense that a civil engineer is. But, upon reflection I actually think that’s the wrong question to ask. Instead, we should consider the fields there where practitioners are responsible for controlling a dynamic process that’s too complex for humans to fully understand. This type of work involves fields such as spaceflight, aviation, maritime, chemical engineering, power generation (nuclear power in particular), anesthesiology, and, yes, operating software services in the cloud.

We all have displays to look at to tell us the current state of things, alerts that tell us something is going wrong, and knobs that we can fiddle with when we need to intervene in order to bring the process back into a healthy state. We all feel production pressure, are faced with ambiguity (is that blip really a problem?), are faced with high-pressure situations, and have to make consequential decisions under very high degrees of uncertainty.

Whether we are engineers or not doesn’t matter. We’re all operators doing our best to bring complex systems under our control. We face similar challenges, and we should recognize that. That is why I’m so fascinated by fields like cognitive systems engineering and resilience engineering. Because it’s so damned relevant to the kind of work that we do in the world of building and operating cloud services.

Designing like a joint cognitive system

There’s a famous paper by Gary Klein, Paul Feltovich, and David Woods, called Common Ground and Coordination in Joint Activity. Written in 2004, this paper discusses the challenges a group of people face when trying to achieve a common goal. The authors introduce the concept of common ground, which must be established and maintained by all of the participants in order for them to reach the goal together.

I’ve blogged previously about the concept of common ground, and the associated idea of the basic compact. (You can also watch John Allspaw discuss the paper at Papers We Love). Common ground is typically discussed in the context of high-tempo activities. The most popular example in our field is an ad hoc team of engineers responding to an incident.

The book Designing Engineers was originally published in 1994, ten years before the Common Ground paper, and so Louis Bucciarelli never uses the phrase. And yet, the book calls forward to the ideas of common ground, and applies them to the lower-tempo work of engineering design. Engineering design, Bucciarelli claims, is a social process. While some design work is solitary, much of it takes place in social interactions, from formal meetings to informal hallway conversations.

But Bucciarelli does more than discover the ideas of common ground, he extends them. Klein et al. talk about the importance of an agreed upon set of rules, and the need to establish interpredictability: for participants to communicate to each other what they’re going to do next. Bucciarelli talks about how engineering design work involves actually developing the rules, making constraints concrete that were initially uncertain. Instead of interpredictability, Bucciarelli talks about how engineers argue for specific interpretations of requirements based on their own interests. Put simply, where Klein et al., talk about establishing, sustaining, and repairing common ground, Bucciarelli talks about constructing, interpreting, and negotiating the design.

Bucciarelli’s book is fascinating because he reveals how messy and uncertain engineering work is, and how concepts that we may think of as fixed and explicit are actually plastic and ambiguous.

For example, we think of building codes as being precise, but when applied to new situations, they are ambiguous, and the engineers must make a judgment about how to apply them. Bucciarelli tells an anecdote about the design of an an array of solar cells to mount on a roof. The building codes put limits on how much weight a roof can support, but the code only discusses distributed loads, and one of the proposed designs is based on four legs, which would be a concentrated load. An engineer and an architect negotiate on the type of design for the mounting: the engineer favors a solution that’s easier for the engineering company, but more work for the architect. The architect favors a solution that is more work and expense for the engineering company. The two must negotiate to reach an agreement on the design, and the relevant building code must be interpreted in this context.

Bucciarelli also observes that the performance requirements given to engineers are much less precise than you would expect, and so the engineers must construct more precise requirements as part of the design work. He gives the example of a company designing a cargo x-ray system for detecting contraband. The requirement is that it should be able to detect “ten pounds of explosive”. As the engineers prepare to test their prototype, a discussion ensues: what is an explosive? Is it a device with wires? A bag of plastic? The engineers must define what an explosive means, and that definition becomes a performance requirement.

Even technical terms that sound well-defined are ambiguous, and may be interpreted differently by different members of the engineering design team. The author witnesses a discussion of “module voltage” for a solar power generator. But the term can refer to open circuit voltage, maximum power voltage, operating voltage, or nominal voltage. It is only through social interactions that this ambiguity is resolved.

What Bucciarelli also notices in his study of engineers is that they do not themselves recognize the messy, social nature of design: they don’t see the work that they do establishing common ground as the design work. I mentioned this in a previous blog post. And that’s really a shame. Because if we don’t recognize these social interactions as design work, we won’t invest in making them better. To borrow a phrase from cognitive systems engineering, we should treat design work as work that’s done by a joint cognitive system.