On work processes and outcomes

Here’s a stylized model of work processes and outcomes. I’m going to call it “Model I”.

Model I: Work process and outcomes

If you do work the right way, that is, follow the proper processes, then good things will happen. And, when we don’t, bad things happen. I work in the software world, so by “bad outcome” a mean an incident, and by “doing the right thing”, the work processes typically refer to software validation activities, such as reviewing pull requests, writing unit tests, manually testing in a staging environment. But it also includes work like adding checks in the code for unexpected inputs, ensuring you have an alert defined to catch problems, having someone else watching over your shoulder when you’re making a risky operational change, not deploying your production changes on a Friday, and so on. Do this stuff, and bad things won’t happen. Don’t do this stuff, and bad things will.

If you push someone who believes in this model, you can get them to concede that sometimes nothing bad happens even though someone didn’t do everything can quite right, the amended model looks like this:

Inevitably, an incident happens. At that point, we focus the post-incident efforts on identifying what went wrong with the work. What was the thing that was done wrong? Sometimes, this is individuals who weren’t following the process (deployed on a Friday afternoon!). Other times, the outcome of the incident investigation is a change in our work processes, because the incident has revealed a gap between “doing the right thing” and “our standard work processes”, so we adjust our work processes to close the gap. For example, maybe we now add an additional level of review and approval for certain types of changes.


Here’s an alternative stylized model of work processes and outcomes. I’m going to call it “Model II”.

Model II: work processes and outcomes

Like our first model, this second model contains two categories of work processes. But the categories here are different. They are:

  1. What people are officially supposed to
  2. What people actually do

The first categorization is an idealized view of how the organization thinks that people should do their work. But people don’t actually do their work their way. The second category captures what the real work actually is.

This second model of work and outcomes has been embraced by a number of safety researchers. I deliberately called my models as Model I and Model II as a reference to Safety-I and Safety-II. Safety-II is a concept developed by the resilience engineering researcher Dr. Erik Hollnagel. The human factor experts Dr. Todd Conklin and Bob Edwards describe this alternate model using a black-line/blue-line diagram. Dr. Steven Shorrock refers to the first category as work-as-prescribed, and the second category as work-as-done. In our stylized model, all outcomes come from this second category of work, because it’s the only one that captures the actual work that leads to any of the outcomes. (In Shorrock’s more accurate model, the two categories of work overlap, but bear with me here).

This model makes some very different assumptions about the nature of how incidents happen! In particular, it leads to very different sorts of questions.

The first model is more popular because it’s more intuitive: when bad things happen, it’s because we did things the wrong way, and that’s when we look back in hindsight to identify what those wrong ways were. The second model requires us to think more about the more common case when incidents don’t happen. After all, we measure our availability in 9s, which means the overwhelming majority of the time, bad outcomes aren’t happening. Hence, Hollnagel encourages us to spend more time examining the common case of things going right.

Because our second model assumes that what people actually do usually leads to good outcomes, it will lead to different sorts of questions after an incident, such as:

  1. What does normal work look like?
  2. How is it that this normal work typically leads to successful outcomes?
  3. What was different in this case (the incident) compared to typical cases?

Note that this second model doesn’t imply that we should always just keep doing things the same way we always do. But it does imply that we should be humble in enforcing changes to the way work is done, because the way that work is done today actually leads to good outcomes most of the time. If you don’t understand how things normally work well, you won’t see how your intervention might make things worse. Just because your last incident was triggered by a Friday deploy doesn’t mean that banning Friday deploys will lead to better outcomes. You might actually end up making things worse.

Whither dashboard design?

The sorry state of dashboards

It’s true: the dashboards we use today for doing operational diagnostic work are … let’s say suboptimal. Charity Majors is one of the founders of Honeycomb, one of the newer generation of observability tools. I’m not a Honeycomb user myself, so I can’t say much intelligently about the product. But my naive understanding is that the primary way an operator interacts with Honeycomb is by querying it. And it sounds like a very nifty tool for doing that, I’ve certainly felt the absence of being able do high-cardinality queries when trying to narrow down where a problem is, and I would love to have access to a tool like that.

But we humans didn’t evolve to query our environment, we evolved to navigate it, and we have a very sophisticated visual system to help us navigate a complex world. Honeycomb does leverage the visual system by generating visualizations, but you submit the query first, and then you get the visualization.

In principle, a well-designed dashboard would engage our visual system immediately: look first, get a clue about where to look next, and then take the next diagnostic step, whether that’s explicitly querying, or navigating to some other visualization. The problem, which Charity illustrates in her tweet, is that we consistently design our dashboards poorly. Given how much information is potentially available to us, we aren’t good at designing dashboards that work well with our human brains to help us navigate all of that information.

Dashboard research of yore

Now, back in the 80s and 90s, for many physical systems that were supervised by operators (think: industrial control systems, power plants, etc.), dashboards was all they had. And there was some interesting cognitive systems engineering research back then about how to design dashboards that took into account what we knew about the human perceptual and cognitive systems.

For example, there was a proposed approach for designing user interfaces for operators called ecological interface design, by Kim Vicente and Jens Rasmussen. Vicente and Rasmussen were both engineering researchers who worked in human factors (Vicente’s background was in industrial and mechanical engineering, Rasmussen’s in electronic engineering). They co-wrote an excellent paper titled Ecological Interface Design: Theoretical Foundations. Ecological Interface Design builds on Rasmussen’s previous work on the abstraction hierarchy, which he developed based on studying how technicians debugged electronic circuits. It also builds on his skills, rules, and knowledge (SRK) framework.

More tactically, David Woods published a set of concepts to better leverage the visual system called visual momentum. These concepts including supporting check-reads (at-a-glance information), longshots, perceptual landmarks, and display overlaps. For more details, see the papers Visual Momentum: A Concept to Improve the Cognitive Coupling of Person and Computer and How Not to Have to Navigate Through Too Many Displays.

What’s the state of dashboard design today?

I’m not aware of anyone in our industry working on the “how do we design better dashboards?” question today. As far as I can tell, discussions around observability these days center more around platform-y questions, like:

  • What kinds of observability data should we collect?
  • How should we store it?
  • What types of queries should we support?

For example, here’s Charity Majors, on “Observability 2.0: How do you debug?“, on the third bullet (emphasis mine):

You check your instrumentation, or you watch your SLOs. If something looks off, you see what all the mysterious events have in common, or you start forming hypotheses, asking a question, considering the result, and forming another one based on the answer. You interrogate your systems, following the trail of breadcrumbs to the answer, every time.

You don’t have to guess or rely on elaborate, inevitably out-of-date mental models. The data is right there in front of your eyes. The best debuggers are the people who are the most curious.

Your debugging questions are analysis-first: you start with your user’s experience.

I’d like to see our industry improve the check your instrumentation part of that to make it easier to identify if something looks off, providing cues about where to look next. To be explicit:

  1. I always want the ability to query my system in the way that Honeycomb supports, with high-cardinality drill-down and correlations.
  2. I always want to start off with a dashboard, not a query interface

In other words, I always want to start off with a dashboard, and use that as a jumping-off point to do queries.

And, maybe there are folks out there in observability-land working on how to improve dashboard design. But, if so, I’m not aware of that work. Just looking at the schedule from Monitorama 2024, the word “dashboard” does not even appear at once.

And that makes me sad. Because, while not everyone has access to tooling like Honeycomb, everyone has access to dashboards. And the state-of-the-dashboard doesn’t seem like it’s going to get any better anytime soon.

Action item template

We’re thrilled that you want to contribute to improving the system in the wake of an incident! For each post-incident action that you are proposing, we would appreciate it if you would fill out the following template.

Please estimate the expected benefits associated with implementing the action item. For example, if this reduces risk, by how much? Please document your risk model. How will you validate this estimate?

Please estimate the costs associated with implementing the proposed action items. In particular:

  • What are the costs in engineering effort (person-days of work) to do the initial implementation?
  • What are the ongoing maintenance costs in terms of engineering effort?
  • What are the additional infrastructure costs?

In addition, please estimate the opportunity costs associated with this action item: if this action item is prioritized, what other important work will be deprioritized as a result? What were the expected benefits of the deprioritized work? How do these unrealized benefits translate into additional costs or risks?

Given that we know we can never implement things perfectly (otherwise the incident wouldn’t have happened, right?), what are the risks associated with a bug or other error when implementing the proposed action item?

Even if the action item is implemented flawlessly, the resulting change in behavior can lead to unforeseen interactions with other parts of the system. Please generate a list of potential harmful interactions that could arise when this action item is implemented. Please be sure to track these and refer back to them if a future incident occurs that involves this action item, to check how well we are able to reason about such interactions.

More generally: will the proposed action item increase or decrease the overall complexity of the system? If it will increase complexity, compare what the costs and/or risks are of the resulting increase in complexity, and compare these to the proposed benefits of the implemented action item.

Will the proposed action item increase or decrease the overall cognitive load on people? If it will increase cognitive load, please estimate the expected magnitude of this increase, and document a plan for evaluating the actual increase after the action item has been implemented.

Beyond cognitive load, is this action going to prevent or otherwise make more difficult any work that goes on today? How will you identify whether this is the case or not? Please document your plan for measuring the resultant increase in difficulty due to the action item.

More generally: will the implementation of this action item lead to people changing the way they do they work? What sort of workarounds or other adaptations may occur as a result, and what are the associated risks of these? What are all of the different types of work that go on in the organization that will be impacted by this change? How will you verify that your list is complete? Please document your plan for studying how the work has actually changed in response to this action item, and how you will contrast the findings with your expectations.

Once the action item has been completed, how are you going to track it unexpectedly contributes to an incident in the future? Please outline your plan for how we will maintain accountability for the impact of completed action items.

The perils of outcome-based analysis

Imagine you wanted to understand how to get better at playing the lottery. You strike upon a research approach: study previous lottery winners! You collect a list of winners, look them up, interview them about how they go about choosing their numbers, collate this data, identify patterns, and use these to define strategies for picking numbers.

The problem with this approach is that it doesn’t tell you anything about how effective these strategies actually are. To really know how well these strategies work, you’d have to look at the entire population of people who employed them. For example, say that you find that most lottery winners use their birthdays to generate winning numbers. It may turn out, that for every winning ticket that has the ticket holder’s birthday, there are 20 million losing tickets that also have the ticket holder’s birthday. To understand a strategy’s effectiveness, you can’t just look at the winning outcomes: you have to look at the losing outcomes as well. The technical term for this type of analytic error is selecting on the dependent variable.

Here’s another example of this error in reasoning: according to the NHTSA, 32% of all traffic crash fatalities in the United States involve drunk drivers. That means that 68% of all traffic crash fatalities involve sober drivers. If you only look at scenarios that involve crash fatalities, it looks like being sober is twice as dangerous as being drunk! It’s a case of only looking at the dependent variable: crash fatalities. If we were to look at all driving scenarios, we’d see that there are a lot more sober drivers than drunk drivers, and that any given sober driver is less likely to get into a crash fatality than a given drunk driver. Being sober is safer, even though sober drivers appear more often in fatal accidents than drunk drivers.

Now, imagine an organization that holds a weekly lottery. But it’s a bizarro-world type of lottery: if someone wins, then they receive a bad outcome instead of a good one. And the bad outcome doesn’t just impact the “winner” (although they are impacted the most), it has negative consequences for the entire organization. Nobody would willingly participate in such a lottery, but everyone in the organization is required to: you can’t opt out. Every week, you have to buy a ticket, and hope the numbers you picked don’t come up.

The organization wants to avoid these negative outcomes, and so they try to identify patterns in how previous lottery “winners” picked their numbers, so that they can reduce the likelihood of future lottery wins by warning people against using these dangerous number-picking strategies.

At this point, the comparison to how we treat incidents should be obvious. If we only examine people’s actions in the wake of an incident, and not when things go well, then we fall into the trap of selecting on the dependent variable.

The real-world case is even worse than the lottery case: lotteries really are random, but that way that people do their work isn’t; rather, it’s adaptive. People do work in specific ways because they have found that it’s an effective way to get stuff done given that the constraints that they are under. The only way to really understand why people work the way they do is to understand how those adaptations usually succeed. Unless you’re really looking for it, you aren’t going to be able to learn how people develop successful adaptations if you only ever examine the adaptations when they fail. Otherwise, you’re just doing the moral equivalent of asking what lottery winners have in common.

When there’s no gemba to go to

I’m finally trying to read through some Toyota-related books to get a better understanding of the lean movement. Not too long ago, I read Sheigo Shingo’s Non-Stock Production: The Shingo System of Continuous Improvement, and sitting on my bookshelf for a future read is James Womack, Daniel Jones, and Daniels Roos’s The Machine That Changed the World: The Story of Lean Production.

The Toyota-themed book I’m currently reading is Mike Rother’s Toyota Kata: Managing People for Improvement, Adaptiveness and Superior Results. Rother often uses the phrase “go and see”, as in “go to the shop floor and observe how the work is actually being done”. I’ve often heard lean advocates use a similar phrase: go the gemba, although Rother himself doesn’t use it in his book. There’s a good overview at the Lean Enterprise Institute’s web page for gemba:

Gemba (現場) is the Japanese term for “actual place,” often used for the shop floor or any place where value-creating work actually occurs. It is also spelled genba. Lean Thinkers use it to mean the place where value is created. Japanese companies often supplement gemba with the related term “genchi gembutsu” — essentially “go and see” — to stress the importance of empiricism.

The idea of focusing on understanding work-as-done is a good one. Unfortunately, in software development in particular, and knowledge work in general, the place that the work gets done is distributed: it happens wherever the employees are sitting in front of their computers. There’s no single place, no shop floor, no gemba that you can go to in order to go and see the work being done.

Now, you can observe the effects of the work, whether it’s artifacts generated (pull requests, docs), or communication (slack messages, emails). And you can talk to people about the work that they do. But, it’s not like going to the shop floor. There is no shop floor.

And it’s precisely because we can’t go to the gemba that incident analysis can bring so much value, because it allows you to essentially conduct a miniature research project to try to achieve the same goal. You get granted some time (a scarce resource!) to reconstruct what happened, by talking to people and looking at those work products generated over time. If we’re good at this, and we’re lucky, we can get a window into how the real work happens.

Why you can’t buy an observability solution off the shelf

An effective observability solution is one that helps an operator quickly answer the question “Why is my system behaving this way?” This is a difficult problem to solve in the general case because it requires the successful combination of three very different things.

Most talk about observability is around vendor tooling. I’m using observability tooling here to refer to the software that you use for collecting, querying, and visualizing signals. There are a lot of observability vendors these days (for example, check out Gartner’s APM and Observability category). But the tooling is only one part of the story: at best, it can provide you with a platform for building an observability solution. The quality of that solution is going to depend on how good your system model is and your cognitive & perceptual model is that it’s encoded in your solution.

A good observability solution will have embedded within it a hierarchical model of your system. This means it needs to provide information about what the system is supposed to do (functional purpose), the subsystems that make up the system, all of the way down to the low-level details of the components (specific lines of code, CPU utilization on the boxes, etc).

As an example, imagine your org has built a custom CI/CD system that you want to observe. Your observability solution should give you feedback about whether the system is actually performing its intended functions: Is it building artifacts and publishing them to the artifact repository? Is it promoting them across environments? Is it running verifications against them?

Now, imagine that there’s a problem with the verifications: it’s starting them but isn’t detecting that they’re complete. The observability solution should enable you to move from that function (verifications) to the subsystem that implements it. That subsystem is likely to include both application logic and external components (e.g., Temporal, MySQL). Your observability solution should have encoded within it the relationships between the functional aspects and the relevant implementations to support the work of moving up and down the abstraction hierarchy while doing the diagnostic work. Building this model can’t be outsourced to a vendor: ultimately, only the engineers who are familiar with the implementation details of the system can create this sort of hierarchical model.

Good observability tooling and a good system model alone aren’t enough. You need to understand how operators do this sort of diagnostic work in order to build an interface that supports this sort of work well. That means you need a cognitive model of the operator to understand how the work gets done. You also need a perceptual model of how it is that humans effectively navigate through the world so you can leverage this model to build an interface that will enable operators to move fluidly across data sources from different systems. You need to build an operational solution with high visual momentum to avoid an operator needing to navigate through too many displays.

Every observability solution has an implicit system model and an implicit cognitive & perceptual model. However, while work on the abstraction hierarchy dates back to the 1980s, and visual momentum goes back to the 1990s, I’ve never seen this work, or the field of cognitive systems engineering in general, explicitly referenced. This work remains largely unknown in the tech world.

Incident categories I’d like to see

If you’re categorizing your incidents by cause, here are some options for causes that I’d love to see used. These are all taken directly from the field of cognitive systems engineering research.

Production pressure

All of us are so often working near saturation: we have more work to do than time to do it. As a consequence, we experience pressure to get that work done, and the pressure affects how we do our work and the decisions we make. Multi-tasking is a good example of a symptom of production pressure.

Ask yourself “for the people whose actions contributed to the incident, what was their personal workload like? How did it shape their actions?”

Goal conflicts

Often we’re trying to achieve multiple goals while doing our work. For example, you may have a goal to get some new feature out quickly (production pressure!), but you also have a goal to keep your system up and running as you make changes. This creates a goal conflict around how much time you should put into validation: the goal of delivering features quickly pushes you towards reducing validation time, and the goal of keeping the system up and running pushes you towards increasing validation time.

If someone asks “Why did you take action X when it clearly contravenes goal G?”, you should ask yourself “was there another important goal, G1, that this action was in support of?”

Workarounds

How do you feel about the quality of the software tools that you use in order to get your work done? (As an example: how are the deployment tools in your org?)

Often the tools that we use are inadequate in one way or another, and so we resort to workarounds: getting our work done in a way that works but is not the “right” way to do it (e.g., not how the tool was designed to be used, against the official process of how to do things). Using workarounds is often dangerous because the system wasn’t designed with that type of work in mind. But if the dangerous way of doing work is the only way that the work can get done, then you’re going to end up with people taking dangerous actions.

If an incident involves someone doing something they weren’t “supposed to”, you should ask yourself, “did they do it this way because they are working around some deficiency in the tools that have to use?”

Automation surprises

Software automation often behaves in ways that people don’t expect: we have incorrect mental models of why the system is doing what it is, often because the system isn’t designed in a way to make it easy for us to form good mental models of behavior. (As someone who works on a declarative deployment system, I acutely feel the pain we can inflict on our users in this area).

If someone took the “wrong” action when interacting with a software system in some way, ask yourself “what was their understanding of the state of the world at the time, and what was their understanding of what the result of that action would be? How did they form their understanding of the system behavior?”


Do you find this topic interesting? If so, I bet you’ll enjoy attending the upcoming Learning from Incidents Conference taking place on Feb 15-16, 2023 in Denver, CO.

There is no “Three Mile Island” event coming for software

In Critical Digital Services: An Under-Studied Safety-Critical Domain, John Allspaw asks:

Critical digital services has yet to experience its “Three-Mile Island” event. Is
such an accident necessary for the domain to take human performance seriously? Or can it translate what other domains have learned and make productive use of
those lessons to inform how work is done and risk is anticipated for the future?

I don’t think the software world will ever experience such an event.

The effect of TMI

The Three Mile Island accident (TMI) is notable, not because of the immediate impact on human lives, but because of the profound effect it had on the field of safety science.

Before TMI, the prevailing theories of accidents was that they were because of issues like mechanical failures (e.g., bridge collapse, boiler explosion), unsafe operator practices, and mixing up physical controls (e.g., switch that lowers the landing gear looks similar to switch that lowers the flaps).

But TMI was different. It’s not that the operators were doing the wrong things, but rather that they did the right things based on their understanding of what was happening, but their understanding of what was happening, which was based on the information that they were getting from their instruments, didn’t match reality. As a result, the actions that they took contributed to the incident, even though they did what they were supposed to do. (For more on this, I recommend watching Richard Cook’s excellent lecture: It all started at TMI, 1979).

TMI led to a kind of Cambrian explosion of research into human error and its role in accidents. This is the beginning of the era where you see work from researchers such as Charles Perrow, Jens Rasmussen, James Reason, Don Norman, David Woods, and Erik Hollnagel.

Why there won’t be a software TMI

TMI was significant because it was an event that could not be explained using existing theories. I don’t think any such event will happen in a software system, because I think that every complex software system failure can be “explained”, even if the resulting explanation is lousy. No matter what the software failure looks like, someone will always be able to identify a “root cause”, and propose a solution (more automation, better procedures). I don’t think a complex software failure is capable of creating TMI style cognitive dissonance in our industry: we’re, unfortunately, too good at explaining away failures without making any changes to our priors.

We’ll continue to have Therac-25s, Knight Capitals, Air France 447s, 737 Maxs, 911 outages, Rogers outages, and Tesla autopilot deaths. Some of them will cause enormous loss of human life, and will result in legislative responses. But no such accident will compel the software industry to, as Allspaw puts it, take human performance seriously.

Our only hope is that the software industry eventually learns the lessons that the safety science learned from the original TMI.

Up and down the abstraction hierarchy

As operators, when the system we operate is working properly, we use a functional description of the system to reason about its behavior.

Here’s an example, taken from my work on a delivery system. if somebody asks me, “Hey, Lorin, how do I configure my deployment so that a canary runs before it deploys to production?”, then I would tell them, “In your deliver config, add a canary constraint to the list of constraints associated with your production environment, and the delivery system will launch a canary and ensure it passes before promoting new versions to production.”

This type of description is functional; It’s the sort of verbiage you’d see in a functional spec. On the other hand, if an alert fires because the environment check rate has dropped precipitously, the first question I’m going to ask is, “did something deploy a code change?” I’m not thinking about function anymore, but I’m thinking of the lowest level of abstraction.

In the mid nineteen-seventies, the safety researcher Jens Rasmussen studied how technicians debugged electronic devices, and in the mid-eighties he proposed a cognitive model about how operators reason about a system when troubleshooting, in a paper titled the role of hierarchical knowledge representation in decisionmaking and system management. He called this model the abstraction hierarchy.

Rasmussen calls this model a means-ends hierarchy, where the “ends” are at the top (the function: what you want the system to do), and the “means” are at the bottom (how the system is physically realized). We describe the proper function of the system top-down, and when we successfully diagnose a problem with the system, we explain the problem bottom-up.

The abstraction hierarchy, explained with an example

Depicts the five levels of the abstraction hierarchy:

1. functional purpose
2. abstract functions
3. general functions
4. physical functions
5. physical form
The five levels of the abstraction hierarchy

To explain these, I’ll use the example of a car.

The functional purpose of the car is to get you from one place to another. But to make things simpler, let’s zoom in on the accelerator. The functional purpose of the accelerator is to make the car go faster.

The abstract functions include transferring power from the car’s power source to the wheels, as well as transferring information from the accelerator to that system about how much power should be delivered. You can think of abstract functions as being functions required to achieve the functional purpose.

The generalized functions are the generic functional building blocks you use to implement the abstract functions. In the case of the car, you need a power source, you need a mechanism for transforming the stored energy to mechanical energy, a mechanism for transferring the mechanical energy to the wheels.

The physical functions capture how the generalized function is physically implemented. In an electric vehicle, your mechanism for transforming stored energy to mechanical energy is an electric motor; in a traditional car, it’s an internal combustion engine.

The physical form captures the construction detail of how the physical function. For example, if it’s an electric vehicle that uses an electric motor, the physical form includes details such as where the motor is located in the car, what its dimensions are, and what materials it is made out of.

Applying the abstraction hierarchy to software

Although Rasmussen had physical systems in mind when he designed the hierarchy (his focus was on process control, and he worked at a lab that focused on nuclear power plants), I think the model can map onto software systems as well.

I’ll use the deployment system that I work on, Managed Delivery, as an example.

The functional purpose is to promote software releases through deployment environments, as specified by the service owner (e.g., first deploy to test environment, then run smoke tests, then deploy to staging, wait for manual judgment, then run a canary, etc.)

Here are some examples of abstract functions in our system.

  • There is an “environment check” control loop that evaluates whether each pending version of code is eligible for promotion to the next environment by checking its constraints.
  • There is a subsystem that listens for “new build” events and stores them in our database.
  • There is a “resource check” control loop that evaluates whether the currently deployed version matches the most recent eligible version.

For generalized functions, here are some larger scale building blocks we use:

  • a queue to consume build events generated by the CI system
  • a relational database to track the state of known versions
  • a workflow management system for executing the control loops

For the physical functions that realize the generalized functions:

  • SQS as our queue
  • MySQL Aurora as our relational database
  • Temporal as our workflow management system

For physical form, I would map these to:

  • source code representation (files and directory structure)
  • binary representation (e.g., container image, Debian package)
  • deployment representation (e.g., organization into clusters, geographical regions)

Consider: you don’t care about how your database is implemented, until you’re getting some sort of operational problem that involves the database, and then you really have to care about how it’s implemented to diagnose the problem.

Why is this useful?

If Rasmussen’s model is correct, then we should build operator interfaces that take the abstraction hierarchy into account. Rasmussen called this approach ecological interface design (EID), where the abstraction hierarchy is explicitly represented in the user interface, to enable operators to more easily navigate the hierarchy as they do their troubleshooting work.

I have yet to see an operator interface that does this well in my domain. One of the challenges is that you can’t rely solely on off-the-shelf observability tooling, because you need to have a model of the functional purpose and the abstract functions to build those models explicitly into your interface. This means that what we really need are toolkits so that organizations can build custom interfaces that can capture those top levels well. In addition, we’re generally lousy at building interfaces that traverse different levels: at best we have links from one system to another. I think the “single pane of glass” marketing suggests that people have some basic understanding of the problem (moving between different systems is jarring), but they haven’t actually figured out how to effectively move between levels in the same system.

Incident response is a team sport

Once upon a time, whenever I was involved in responding to an incident, and a teammate ended up diagnosing the failure mode, I would kick myself afterwards. How come I couldn’t figure out what was wrong? Why hadn’t I thought to do what they had done?

However, after enough exposure to the cognitive systems engineering literature, something finally clicked in my mind. When a group of people respond to an incident, it’s never the responsibility of a single individual to remediate. It can’t be, because we each know our own corners of the system better than our teammates. Instead, it is the responsibility of the group of incident responders as a whole to resolve the incident.

The group of incident responders, that ad-hoc team that forms in the moment, is what’s referred to as a joint cognitive system. It’s the responsibility of the individual responders to coordinate effectively so that the cognitive system can solve the problem. Often that involves dynamically distributing the workload so that individuals can focus on specific tasks.

Resolving incidents is a team effort. Go team!