Critical digital services has yet to experience its “Three-Mile Island” event. Is such an accident necessary for the domain to take human performance seriously? Or can it translate what other domains have learned and make productive use of those lessons to inform how work is done and risk is anticipated for the future?
I don’t think the software world will ever experience such an event.
The effect of TMI
The Three Mile Island accident (TMI) is notable, not because of the immediate impact on human lives, but because of the profound effect it had on the field of safety science.
Before TMI, the prevailing theories of accidents was that they were because of issues like mechanical failures (e.g., bridge collapse, boiler explosion), unsafe operator practices, and mixing up physical controls (e.g., switch that lowers the landing gear looks similar to switch that lowers the flaps).
But TMI was different. It’s not that the operators were doing the wrong things, but rather that they did the right things based on their understanding of what was happening, but their understanding of what was happening, which was based on the information that they were getting from their instruments, didn’t match reality. As a result, the actions that they took contributed to the incident, even though they did what they were supposed to do. (For more on this, I recommend watching Richard Cook’s excellent lecture: It all started at TMI, 1979).
TMI led to a kind of Cambrian explosion of research into human error and its role in accidents. This is the beginning of the era where you see work from researchers such as Charles Perrow, Jens Rasmussen, James Reason, Don Norman, David Woods, and Erik Hollnagel.
Why there won’t be a software TMI
TMI was significant because it was an event that could not be explained using existing theories. I don’t think any such event will happen in a software system, because I think that every complex software system failure can be “explained”, even if the resulting explanation is lousy. No matter what the software failure looks like, someone will always be able to identify a “root cause”, and propose a solution (more automation, better procedures). I don’t think a complex software failure is capable of creating TMI style cognitive dissonance in our industry: we’re, unfortunately, too good at explaining away failures without making any changes to our priors.
As operators, when the system we operate is working properly, we use a functional description of the system to reason about its behavior.
Here’s an example, taken from my work on a delivery system. if somebody asks me, “Hey, Lorin, how do I configure my deployment so that a canary runs before it deploys to production?”, then I would tell them, “In your deliver config, add a canary constraint to the list of constraints associated with your production environment, and the delivery system will launch a canary and ensure it passes before promoting new versions to production.”
This type of description is functional; It’s the sort of verbiage you’d see in a functional spec. On the other hand, if an alert fires because the environment check rate has dropped precipitously, the first question I’m going to ask is, “did something deploy a code change?” I’m not thinking about function anymore, but I’m thinking of the lowest level of abstraction.
Rasmussen calls this model a means-ends hierarchy, where the “ends” are at the top (the function: what you want the system to do), and the “means” are at the bottom (how the system is physically realized). We describe the proper function of the system top-down, and when we successfully diagnose a problem with the system, we explain the problem bottom-up.
The abstraction hierarchy, explained with an example
To explain these, I’ll use the example of a car.
The functional purpose of the car is to get you from one place to another. But to make things simpler, let’s zoom in on the accelerator. The functional purpose of the accelerator is to make the car go faster.
The abstract functions include transferring power from the car’s power source to the wheels, as well as transferring information from the accelerator to that system about how much power should be delivered. You can think of abstract functions as being functions required to achieve the functional purpose.
The generalized functions are the generic functional building blocks you use to implement the abstract functions. In the case of the car, you need a power source, you need a mechanism for transforming the stored energy to mechanical energy, a mechanism for transferring the mechanical energy to the wheels.
The physical functions capture how the generalized function is physically implemented. In an electric vehicle, your mechanism for transforming stored energy to mechanical energy is an electric motor; in a traditional car, it’s an internal combustion engine.
The physical form captures the construction detail of how the physical function. For example, if it’s an electric vehicle that uses an electric motor, the physical form includes details such as where the motor is located in the car, what its dimensions are, and what materials it is made out of.
Applying the abstraction hierarchy to software
Although Rasmussen had physical systems in mind when he designed the hierarchy (his focus was on process control, and he worked at a lab that focused on nuclear power plants), I think the model can map onto software systems as well.
I’ll use the deployment system that I work on, Managed Delivery, as an example.
The functional purpose is to promote software releases through deployment environments, as specified by the service owner (e.g., first deploy to test environment, then run smoke tests, then deploy to staging, wait for manual judgment, then run a canary, etc.)
Here are some examples of abstract functions in our system.
There is an “environment check” control loop that evaluates whether each pending version of code is eligible for promotion to the next environment by checking its constraints.
There is a subsystem that listens for “new build” events and stores them in our database.
There is a “resource check” control loop that evaluates whether the currently deployed version matches the most recent eligible version.
For generalized functions, here are some larger scale building blocks we use:
a queue to consume build events generated by the CI system
a relational database to track the state of known versions
a workflow management system for executing the control loops
For the physical functions that realize the generalized functions:
SQS as our queue
MySQL Aurora as our relational database
Temporal as our workflow management system
For physical form, I would map these to:
source code representation (files and directory structure)
deployment representation (e.g., organization into clusters, geographical regions)
Consider: you don’t care about how your database is implemented, until you’re getting some sort of operational problem that involves the database, and then you really have to care about how it’s implemented to diagnose the problem.
Why is this useful?
If Rasmussen’s model is correct, then we should build operator interfaces that take the abstraction hierarchy into account. Rasmussen called this approach ecological interface design (EID), where the abstraction hierarchy is explicitly represented in the user interface, to enable operators to more easily navigate the hierarchy as they do their troubleshooting work.
I have yet to see an operator interface that does this well in my domain. One of the challenges is that you can’t rely solely on off-the-shelf observability tooling, because you need to have a model of the functional purpose and the abstract functions to build those models explicitly into your interface. This means that what we really need are toolkits so that organizations can build custom interfaces that can capture those top levels well. In addition, we’re generally lousy at building interfaces that traverse different levels: at best we have links from one system to another. I think the “single pane of glass” marketing suggests that people have some basic understanding of the problem (moving between different systems is jarring), but they haven’t actually figured out how to effectively move between levels in the same system.
Once upon a time, whenever I was involved in responding to an incident, and a teammate ended up diagnosing the failure mode, I would kick myself afterwards. How come I couldn’t figure out what was wrong? Why hadn’t I thought to do what they had done?
However, after enough exposure to the cognitive systems engineering literature, something finally clicked in my mind. When a group of people respond to an incident, it’s never the responsibility of a single individual to remediate. It can’t be, because we each know our own corners of the system better than our teammates. Instead, it is the responsibility of the group of incident responders as a whole to resolve the incident.
The group of incident responders, that ad-hoc team that forms in the moment, is what’s referred to as a joint cognitive system. It’s the responsibility of the individual responders to coordinate effectively so that the cognitive system can solve the problem. Often that involves dynamically distributing the workload so that individuals can focus on specific tasks.
When an incident happens, one of the causes is invariably identified as human error: somebody along the way made a mistake, did something they shouldn’t have done. For example: that engineer shouldn’t have done that clearly risky deployment and then walked away without babysitting it. Labeling an action as human error is an unfortunately effective way at ending an investigation (root cause: human error).
Some folks try to make progress on the current status quo by arguing that, since human error is inevitable (people make mistakes!), it should be the beginning of the investigation, rather than the end. I respect this approach, but I’m going to take a more extreme view here: we can gain insight into how incidents happen, even those that involve operator actions as contributing factors, without reference to human error at all.
Since we human beings are physical beings, you can think of us as machines. Specifically, we are machines that make decisions and take action based on those decisions. Now, imagine that every decision we make involves our brain trying to maximize a function: when provided with a set of options, it picks the one that has the largest value. Let’s call this function g, for goodness.
(The neuroscientist Karl Friston has actually proposes something similar as a theory: organisms make decisions to minimize model surprise, a construct that Friston calls free energy).
In this (admittedly simplistic) model of human behavior, all decision making is based on an evaluation of g. Each person’s g will vary based on their personal history and based on their current context: what they currently see and hear, as well as other factors such as time pressure and conflicting goals. “History” here is very broad, as g will vary based not only on what you’ve learned in the past, but also on physiological factors like how much sleep you had last night and what you ate for breakfast.
Under this paradigm, if one of the contributing factors in an incident was the user pushing “A” instead of “B”, we ask “how did the operator’s g function score a higher value for pushing A over B”? There’s no concept of “error” in this model. Instead, we can explore the individual’s context and history to get a better understanding of how their g function valued A over B. We accomplish this by talking to them.
I think the model above is much more fruitful than the one where we identify errors or mistakes. In this model, we have to confront the context and the history that a person was exposed to, because those are the factors that determine how decisions get made.
The idea of human error is a hard thing to shake. But I think we’d be better off if we abandoned it entirely.
Some additional reading on the idea of human error:
There’s a wonderful book by the late urban planning professor Donald Schön called The Reflective Practitioner: How Professionals Think in Action. In the first chapter, he discusses the “rigor or relevance” dilemma that faces educators in professional degree programs. In the case of a university program aimed at preparing students for a career in software development, this is the “should we teach topological sort or React?” question.
Schön argues that the dilemma itself is a fundamental misunderstanding of the nature of professional work. What it misses is the ambiguity and uncertainty inherent in the work of professional life. The “rigor vs relevance” debate is an argument over the best way to get from the problem to the solution: do you teach the students first principles, or do you teach them how to use the current set of tools? Schön observes that a more significant challenge for professionals is defining the problems to solve in the first place, since an ill-defined problem admits no technical solution at all.
In the varied topography of professional practice, there is a high, hard ground where practitioners can make effective use of research-based theory and technique, and there is a swampy lowland where situations are confusing “messes” incapable of technical solution. The difficulty is that the problems of the high ground, however great their technical interest, are often relatively unimportant to clients or to the larger society, while in the swamp are the problems of greatest human concern.
Managers are not confronted with problems that are independent of each other, but with dynamic situations that consist of complex systems of changing problems that interact with each other. I call such situations messes. Problems are abstractions extracted from messes by analysis; they are to messes as atoms are to tables and chairs. We experience messes, tables, and chairs; not problems and atoms
To take another example from the software domain. Imagine that you’re doing quarterly planning, and there’s a collection of reliability work that you’d like to do, and you’re trying to figure out how to prioritize it. You could apply a rigorous approach, where you quantify some values in order to do the prioritization work, and so you try to estimate information like:
the probability of hitting a problem if the work isn’t done
the cost to the organization if the problem is encountered
the amount of effort involved in doing the reliability work
But you’re soon going to discover the enormous uncertainty involved in trying to put a number on any of those things. And, in fact, doing any reliability work can actually introduce new failure modes.
Over and over, I’ve seen the theme of ambiguity and uncertainty appear in ethnographic research that looks at professional work in action. In Designing Engineers, the aerospace engineering professor Louis Bucciarelli did an ethnographic study of engineers in a design firm, and discovered that the engineers all had partial understanding of the problem and solution space, and that their understandings also overlapped only partially. As a consequence, a lot of the engineering work that was done actually involved engineers resolving their incomplete understanding through various forms of communication, often informal. Remarkably, the engineers were not themselves aware of this process of negotiating understandings of the problems and solutions.
You’ll sometimes hear researchers who study work talk about the process of sensemaking. For example, there’s a paper by Sana Albolino, Richard Cook, and Micahel O’Connor called Sensemaking, safety, and cooperative work in the intensive care unit that describes this type of work in an intensive care unit. I think of sensemaking as an activity that professionals perform to try to resolve ambiguity and uncertainty.
(Ambiguity isn’t always bad. In the book On Line and On Paper, the sociologist Kathryn Henderson describes how engineers use engineering drawings as boundary objects. These are artifacts are that are understood differently by the different stakeholders: two engineers looking at the same drawing will have different mental models of the artifact based on their own domain expertise(!). However, there is also overlap in their mental models, and it is this combination of overlap and the fact that individuals can use the same artifact for different purposes that makes it useful. Here the ambiguity has actual value! In fact, her research shows that computer models, which eliminate the ambiguity, were less useful for this sort of work).
As practitioners, we have no choice: we always have to deal with ambiguity. As noted by Richard Cook in the quote that opens this blog post, we are the ones, at the sharp end, that are forced to resolve it.
Let me make one other observation about this that I think is important, which is that this occurred during startup. That is, once these processes get going, they work in a way that’s different than starting them up. So starting up the process requires a different set of activities than running it continuously. Once you have it running continuously, you can be pouring stuff in one end and getting it out the other, and everything runs smoothly in-between. But startup doesn’t, it doesn’t have things in it, so you have to prime all the pumps by doing a different set of operations.
I was attending the Resilience Engineering Association – Naturalistic Decision Making Symposium last month, and one of the talks was by a medical doctor (an anesthesiologist) who was talking about analyzing incidents in anesthesiology. I immediately thought of Dr. Richard Cook, who is also an anesthesiologist, who has been very active in the field of resilience engineering, and I wondered, “what is it with anesthesiology and resilience engineering?” And then it hit me: it’s about process control.
As software engineers in the field we call “tech”, we often discuss whether we are really engineers in the same sense that a civil engineer is. But, upon reflection I actually think that’s the wrong question to ask. Instead, we should consider the fields there where practitioners are responsible for controlling a dynamic process that’s too complex for humans to fully understand. This type of work involves fields such as spaceflight, aviation, maritime, chemical engineering, power generation (nuclear power in particular), anesthesiology, and, yes, operating software services in the cloud.
We all have displays to look at to tell us the current state of things, alerts that tell us something is going wrong, and knobs that we can fiddle with when we need to intervene in order to bring the process back into a healthy state. We all feel production pressure, are faced with ambiguity (is that blip really a problem?), are faced with high-pressure situations, and have to make consequential decisions under very high degrees of uncertainty.
Whether we are engineers or not doesn’t matter. We’re all operators doing our best to bring complex systems under our control. We face similar challenges, and we should recognize that. That is why I’m so fascinated by fields like cognitive systems engineering and resilience engineering. Because it’s so damned relevant to the kind of work that we do in the world of building and operating cloud services.
There’s a famous paper by Gary Klein, Paul Feltovich, and David Woods, called Common Ground and Coordination in Joint Activity. Written in 2004, this paper discusses the challenges a group of people face when trying to achieve a common goal. The authors introduce the concept of common ground, which must be established and maintained by all of the participants in order for them to reach the goal together.
I’ve bloggedpreviously about the concept of common ground, and the associated idea of the basic compact. (You can also watch John Allspaw discuss the paper at Papers We Love). Common ground is typically discussed in the context of high-tempo activities. The most popular example in our field is an ad hoc team of engineers responding to an incident.
The book Designing Engineers was originally published in 1994, ten years before the Common Ground paper, and so Louis Bucciarelli never uses the phrase. And yet, the book calls forward to the ideas of common ground, and applies them to the lower-tempo work of engineering design. Engineering design, Bucciarelli claims, is a social process. While some design work is solitary, much of it takes place in social interactions, from formal meetings to informal hallway conversations.
But Bucciarelli does more than discover the ideas of common ground, he extends them. Klein et al. talk about the importance of an agreed upon set of rules, and the need to establish interpredictability: for participants to communicate to each other what they’re going to do next. Bucciarelli talks about how engineering design work involves actually developing the rules, making constraints concrete that were initially uncertain. Instead of interpredictability, Bucciarelli talks about how engineers argue for specific interpretations of requirements based on their own interests. Put simply, where Klein et al., talk about establishing, sustaining, and repairing common ground, Bucciarelli talks about constructing, interpreting, and negotiating the design.
Bucciarelli’s book is fascinating because he reveals how messy and uncertain engineering work is, and how concepts that we may think of as fixed and explicit are actually plastic and ambiguous.
For example, we think of building codes as being precise, but when applied to new situations, they are ambiguous, and the engineers must make a judgment about how to apply them. Bucciarelli tells an anecdote about the design of an an array of solar cells to mount on a roof. The building codes put limits on how much weight a roof can support, but the code only discusses distributed loads, and one of the proposed designs is based on four legs, which would be a concentrated load. An engineer and an architect negotiate on the type of design for the mounting: the engineer favors a solution that’s easier for the engineering company, but more work for the architect. The architect favors a solution that is more work and expense for the engineering company. The two must negotiate to reach an agreement on the design, and the relevant building code must be interpreted in this context.
Bucciarelli also observes that the performance requirements given to engineers are much less precise than you would expect, and so the engineers must construct more precise requirements as part of the design work. He gives the example of a company designing a cargo x-ray system for detecting contraband. The requirement is that it should be able to detect “ten pounds of explosive”. As the engineers prepare to test their prototype, a discussion ensues: what is an explosive? Is it a device with wires? A bag of plastic? The engineers must define what an explosive means, and that definition becomes a performance requirement.
Even technical terms that sound well-defined are ambiguous, and may be interpreted differently by different members of the engineering design team. The author witnesses a discussion of “module voltage” for a solar power generator. But the term can refer to open circuit voltage, maximum power voltage, operating voltage, or nominal voltage. It is only through social interactions that this ambiguity is resolved.
What Bucciarelli also notices in his study of engineers is that they do not themselves recognize the messy, social nature of design: they don’t see the work that they do establishing common ground as the design work. I mentioned this in a previous blog post. And that’s really a shame. Because if we don’t recognize these social interactions as design work, we won’t invest in making them better. To borrow a phrase from cognitive systems engineering, we should treat design work as work that’s done by a joint cognitive system.
A conscription device is something that can be used to help recruit other people to get involved in a task: mechanical engineers collaborate using diagrams. These diagrams play such a strong role that the engineers find that they can’t work effectively without them. From the paper:
If a visual representation is not brought to a meeting of those involved with the design, someone will sketch a facsimile on a white board (present in all engineering conference rooms) when communication begins to falter, or a team member will leave the meeting to fetch the crucial drawings so group members will be able to understand one another.
A boundary object is an artifact that can be consumed by different stakeholders, who use the artifact for different purposes. Henderson uses the example of the depiction of a welded joint in a drawing, which has different meanings for the designer (support structure) than it does for someone working in the shop (labor required to do the weld). A shop worker might see the drawing and suggest a change that would save welds (and hence labor).:
Detail renderings are one of the tightly focused portions that make up the more flexible whole of a drawing set. For example, the depiction of a welded joint may stand for part of the support structure to the designer and stand for labor expended to those in the shop.If the designer consults with workers who suggest a formation that will save welds and then incorporates the advice, collective knowledge is captured in the design. One small part of the welders’ tacit knowledge comes to be represented visually in the drawing. Hence the flexibility of the sketch or drawing as a boundary object helps in enlisting the aid and knowledge of additional participants.
Because we software engineers don’t work in a visual medium, we don’t work from visual representations the way that mechanical engineers do. However, we still have a need to engage with other engineers to work with us, and we need to communicate with different stakeholders about the software that we build.
A few months ago, I wrote up a Google doc with a spec for some proposed new functionality for a system that I work on. It included scenario descriptions that illustrated how a user would interact with the system. I shared the doc out, and got a lot of feedback, some of it from potential users of the system who were looking for additional scenarios, and some from adjacent teams who were concerned about the potential misuse of the feature for unintended purposes.
This sort of Google doc does function like a conscription device and boundary object. Google makes it easy to add comments to a doc. Yes, comments don’t scale up well, but the ease of creating a comment makes Google docs effective as potential conscription devices. If you share the doc out, and comments are enabled, people will comment.
I also found that writing out scenarios, little narrative descriptions of people interacting with the system, made it easier for people to envision what using the system will be like, and so I consequently got feedback from different types of stakeholders.
My point here is not that scenarios written in Google docs are like mechanical engineering drawings: those are very different kinds of artifacts that play different roles. Rather, the point is that properties of an artifact can affect how people collaborate to get engineering work done. We probably don’t think of Google doc as a software engineering tool. But it can be an extremely powerful one.
The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination. Few media of creation are so flexible, so easy to polish and rework, so readily capable of realizing grand conceptual structures.
Fred Brooks, The Mythical Man-Month
We software engineers don’t work in a physical medium the way, say, civil, mechanical, electrical, or chemical engineers do. Yes, our software does run on physical machines, and we are not exempt from dealing with limits. But, as captured in that Fred Brooks quote above, there’s a sense in which we software folk feel that we are working in a medium that is limited only by our own minds, by the complexity of these ethereal artifacts we create. When a software system behaves in an unexpected way, we consider it a design flaw: the engineer was not sufficiently smart.
And, yet, contra Brooks, software is a limited medium. Let’s look at two areas where that’s the case.
Software is discrete in a way that the world isn’t
We persist our data in databases that have schemas, which force us to slice up our information in ways that we can represent. But the real world is not so amenable to this type of slicing: it’s a messy place. The mismatch between the messiness of the real world and the structured nature of software data representations results in a medium that is not well-suited to model the way humans treat concepts such as names or time.
Software as a medium, and data storage in particular, encourages over-simplification of the world, because we need to categorize our data, figure out which tables to store it in and what values those columns should have, and so many items in the world just aren’t easy to model well like that.
As an example, consider a common question in my domain, software deployment: is a cluster up? We have to make a decision about that, and yet the answer is often “it depends: why do you want to know”? But that’s not what software as a medium encourages. Instead, we pick a definition of “up”, implement it, and then hope that it meets most needs, knowing it won’t. We can come up with other definitions for other circumstances, but we can’t be comprehensive, and we can’t be flexible. We have to bake in those assumptions.
Software systems are limited in how they integrate inputs
In the book Problem Frames, Michael Jackson describes several examples of software problems. One of them is a system for counting how many cars pass by on a street. The inputs are two sensors that emit a signal when the cars drive over them. Those two sensors provide a lot less input than a human would have sitting by the side of the road and counting the cars go by.
As humans, when we need to make decisions, we can flexibly integrate a lot of different information signals. If I’m talking to you, for example, I can listen to what you’re saying, and I can also read the expressions on your face. I can make judgments based on how you worded your Slack message, and based on how well I already know you. I can use all of that different information to build a mental model of your actual internal state. Software isn’t like that: we have to hard-code, in advance, the different inputs that the software system will use to make decisions. Software as a medium is inherently limited in modeling external systems that it interacts with.
A couple of months ago, I wrote a blog post titled programming means never getting to say “it depends”, where I used the example of an alerting system: when do you alert a human operator of a potential problem? As humans, we can develop mental models of the human operator: “does the operator already know about X? Wait, I see that they are engaged based on their Slack messages, so I don’t need to alert them, they’re already on it.”
Good luck building an alerting system that constructs a model of the internal state of a human operator! Software just isn’t amenable to incorporating all of the possible signals we might get from a system.
Recognizing the limits of software
The lesson here is that there are limits to how well software system can actually perform, given the limits of software. It’s not simply a matter of managing complexity or avoiding design flaws: yes, we can always build more complex schemas to handle more cases, and build our systems to incorporate large input sets, but this is the equivalent of adding epicycles. Incorrect categorizations and incorrect automated decisions are inevitable, no matter how complex our systems become. They are inherent to the nature of software systems. We’re always going to need to have humans-in-the-loop to make up for these sorts of shortcomings.
The goal is not to build better software systems, but how to build better joint cognitive systems that are made up of humans and software together.