Conscription devices, boundary objects, and GDocs

In the paper Flexible Sketches and Inflexible Data Bases: Visual Communication, Conscription Devices, and Boundary Objects in Design Engineering, Kathryn Henderson writes about the role that sketches and drawings play in the work of mechanical engineers.

A conscription device is something that can be used to help recruit other people to get involved in a task: mechanical engineers collaborate using diagrams. These diagrams play such a strong role that the engineers find that they can’t work effectively without them. From the paper:

If a visual representation is not brought to a meeting of those involved with the design, someone will sketch a facsimile on a white board (present in all engineering conference rooms) when communication begins to falter, or a team member will leave the meeting to fetch the crucial drawings so group members will be able to understand one another.

A boundary object is an artifact that can be consumed by different stakeholders, who use the artifact for different purposes. Henderson uses the example of the depiction of a welded joint in a drawing, which has different meanings for the designer (support structure) than it does for someone working in the shop (labor required to do the weld). A shop worker might see the drawing and suggest a change that would save welds (and hence labor).:

Detail renderings are one of the tightly focused portions that make up the more flexible whole of a drawing set. For example, the depiction of a welded joint may stand for part of the support structure to the designer and stand for labor expended to those in the shop.If the designer consults with workers who suggest a formation that will save welds and then incorporates the advice, collective knowledge is captured in the design. One small part of the welders’ tacit knowledge comes to be represented visually in the drawing. Hence the flexibility of the sketch or drawing as a boundary object helps in enlisting the aid and knowledge of additional participants.

Because we software engineers don’t work in a visual medium, we don’t work from visual representations the way that mechanical engineers do. However, we still have a need to engage with other engineers to work with us, and we need to communicate with different stakeholders about the software that we build.

A few months ago, I wrote up a Google doc with a spec for some proposed new functionality for a system that I work on. It included scenario descriptions that illustrated how a user would interact with the system. I shared the doc out, and got a lot of feedback, some of it from potential users of the system who were looking for additional scenarios, and some from adjacent teams who were concerned about the potential misuse of the feature for unintended purposes.

This sort of Google doc does function like a conscription device and boundary object. Google makes it easy to add comments to a doc. Yes, comments don’t scale up well, but the ease of creating a comment makes Google docs effective as potential conscription devices. If you share the doc out, and comments are enabled, people will comment.

I also found that writing out scenarios, little narrative descriptions of people interacting with the system, made it easier for people to envision what using the system will be like, and so I consequently got feedback from different types of stakeholders.

My point here is not that scenarios written in Google docs are like mechanical engineering drawings: those are very different kinds of artifacts that play different roles. Rather, the point is that properties of an artifact can affect how people collaborate to get engineering work done. We probably don’t think of Google doc as a software engineering tool. But it can be an extremely powerful one.

Software as a limited medium

The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination. Few media of creation are so flexible, so easy to polish and rework, so readily capable of realizing grand conceptual structures.

Fred Brooks, The Mythical Man-Month

We software engineers don’t work in a physical medium the way, say, civil, mechanical, electrical, or chemical engineers do. Yes, our software does run on physical machines, and we are not exempt from dealing with limits. But, as captured in that Fred Brooks quote above, there’s a sense in which we software folk feel that we are working in a medium that is limited only by our own minds, by the complexity of these ethereal artifacts we create. When a software system behaves in an unexpected way, we consider it a design flaw: the engineer was not sufficiently smart.

And, yet, contra Brooks, software is a limited medium. Let’s look at two areas where that’s the case.

Software is discrete in a way that the world isn’t

We persist our data in databases that have schemas, which force us to slice up our information in ways that we can represent. But the real world is not so amenable to this type of slicing: it’s a messy place. The mismatch between the messiness of the real world and the structured nature of software data representations results in a medium that is not well-suited to model the way humans treat concepts such as names or time.

Software as a medium, and data storage in particular, encourages over-simplification of the world, because we need to categorize our data, figure out which tables to store it in and what values those columns should have, and so many items in the world just aren’t easy to model well like that.

As an example, consider a common question in my domain, software deployment: is a cluster up? We have to make a decision about that, and yet the answer is often “it depends: why do you want to know”? But that’s not what software as a medium encourages. Instead, we pick a definition of “up”, implement it, and then hope that it meets most needs, knowing it won’t. We can come up with other definitions for other circumstances, but we can’t be comprehensive, and we can’t be flexible. We have to bake in those assumptions.

And so, just like all engineers, given our time and resource constraints, we have to make over-simplifications to get our work done. William Kent wrote a whole book on this topic called Data and Reality: A Timeless Perspective on Perceiving and Managing Information in Our Imprecise World (h/t Hillel Wayne).

Software systems are limited in how they integrate inputs

In the book Problem Frames, Michael Jackson describes several examples of software problems. One of them is a system for counting how many cars pass by on a street. The inputs are two sensors that emit a signal when the cars drive over them. Those two sensors provide a lot less input than a human would have sitting by the side of the road and counting the cars go by.

As humans, when we need to make decisions, we can flexibly integrate a lot of different information signals. If I’m talking to you, for example, I can listen to what you’re saying, and I can also read the expressions on your face. I can make judgments based on how you worded your Slack message, and based on how well I already know you. I can use all of that different information to build a mental model of your actual internal state. Software isn’t like that: we have to hard-code, in advance, the different inputs that the software system will use to make decisions. Software as a medium is inherently limited in modeling external systems that it interacts with.

A couple of months ago, I wrote a blog post titled programming means never getting to say “it depends”, where I used the example of an alerting system: when do you alert a human operator of a potential problem? As humans, we can develop mental models of the human operator: “does the operator already know about X? Wait, I see that they are engaged based on their Slack messages, so I don’t need to alert them, they’re already on it.”

Good luck building an alerting system that constructs a model of the internal state of a human operator! Software just isn’t amenable to incorporating all of the possible signals we might get from a system.

Recognizing the limits of software

The lesson here is that there are limits to how well software system can actually perform, given the limits of software. It’s not simply a matter of managing complexity or avoiding design flaws: yes, we can always build more complex schemas to handle more cases, and build our systems to incorporate large input sets, but this is the equivalent of adding epicycles. Incorrect categorizations and incorrect automated decisions are inevitable, no matter how complex our systems become. They are inherent to the nature of software systems. We’re always going to need to have humans-in-the-loop to make up for these sorts of shortcomings.

The goal is not to build better software systems, but how to build better joint cognitive systems that are made up of humans and software together.

The power of functionalism

Most software engineers are likely familiar with functional programming. The idea of functionalism, focusing on the “what” rather than the “how”, doesn’t just apply to programming. I was reminded of how powerful a functionalist approach is this week as while I’ve been attending the STAMP workshop. STAMP is an approach to systems safety developed by Nancy Leveson.

The primary metaphor in STAMP is the control system: STAMP employs a control system model to help reason about the safety of a system. This is very much a functionalist approach, as it models agents in the system based only on what control actions they can take and what feedback they can receive. You can use this same model to reason about a physical component, a software system, a human, a team, an organization, even a regulatory body. As long as you can identify the inputs your component receives, and the control actions that it can perform, you can model it as a control system.

Cognitive systems engineering (CSE) uses a different metaphor: that of a cognitive system. But CSE also takes a functional approach, observing how people actually work and trying to identify what functions their actions serve in the system. It’s a bottom-up functionalism where STAMP is top-down, so it yields different insights into the system.

What’s appealing to me about these functionalist approaches is that they change the way I look at a problem. They get me to think about the problem or system at hand in a different way than I would have if I didn’t take a deliberately take a functional approach. And “it helped me look at the world in a different way” is the highest compliment I can pay to a technology.

SRE, CSE, and the safety boundary

Site reliability engineering (SRE) and cognitive systems engineering (CSE) are two fields seeking the same goal: helping to design, build, and operate complex, software-intensive systems that stay up and running. They both worry about incidents and human workload, and they both reason about systems in terms of models. But their approaches are very different, and this post is about exploring one of those differences.

Caveat: I believe that you can’t really understand a field unless you either have direct working experience, or you have observed people doing work in the field. I’m not a site reliability engineer or a cognitive systems engineer, nor have I directly observed SREs or CSEs at work. This post is an outsider’s perspective on both of these fields. But I think it holds true to the philosophies that these approaches espouse publicly. Whether it corresponds to the actual day-to-day work of SREs and CSEs, I will leave to the judgment of the folks on the ground who actually do SRE or CSE work.

A bit of background

Site reliability engineering was popularized by Google, and continues to be strongly associated with the company. Google has published three O’Reilly books, the first one in 2016. I won’t say any more about the background of SRE here, but there are many other sources (including the Google books) for those who want to know more about the background.

Cognitive systems engineering is much older, tracing its roots back to the early eighties. If SRE is, as Ben Treynor described it what happens when you ask a software engineer to design an operations function, then CSE is what happens when you ask a psychologist how to prevent nuclear meltdowns.

CSE emerged in the wake of the Three Mile Island accident of 1979, where researchers were trying to make sense of how the accident happened. Before Three Mile Island, research on "human factors" aspects of work had focused on human physiology (for example, designing airplane cockpits), but after TMI the focused expanded to include cognitive aspects of work. The two researchers most closely associated with CSE, Erik Hollnagel and David Woods, were both trained as psychology researchers: their paper Cognitive Systems Engineering: New wine in new bottles marks the birth of the field (Thai Wood covered this paper in his excellent Resilience Roundup newsletter).

CSE has been applied in many different domains, but I think it would be unknown in the "tech" community were it not for the tireless efforts of John Allspaw to popularize the results of CSE research that has been done in the past four decades.

A useful metaphor: Rasmussen’s dynamic safety model

Jens Rasmussen was a Danish safety researcher whose work remains deeply influential in CSE. In 1997 he published a paper titled Risk management in a dynamic society: a modelling problem. This paper introduced the metaphor of the safety boundary, as illustrated in the following visual model, which I’ve reproduced from this paper:

Rasmussen viewed a safety-critical system as a point that moves inside of a space enclosed by three boundaries.

At the top right is what Rasmussen called the "boundary to economic failure". If the system crosses this boundary, then the system will fail due to poor economic performance. We know that if we try to work too quickly, we sacrifice safety. But we can’t work arbitrarily slowly to increase safety, because then we won’t get anything done. Management naturally puts pressure on the system to move away from this boundary.

At the bottom right is what Rasmussen called the "boundary of unacceptable work load". Management can apply pressure on the workforce to work both safely and quickly, but increasing safety and increasing productivity both require effort on behalf of practitioners, and there are limits to the amount of work that people can do. Practitioners naturally put pressure on the system to move away from this boundary.

At the left, the diagram has two boundaries. The outer boundary is what Rasmussen called the "boundary of functionally acceptable performance", what I’ll call the safety boundary. If the system crosses this boundary, an incident happens. We can never know exactly where this boundary is. The inner boundary is labelled "resulting perceived boundary of acceptable performance". That’s where we think the boundary is, and where we try to stay away from.

SRE vs CSE in context of the dynamic safety model

I find the dynamic safety model useful because I think it illustrates the difference in focus between SRE and CSE.

SRE focuses on two questions:

  1. How do we keep the system away from the safety boundary?
  2. What do we do once we’ve crossed the boundary?

To deal with the first question, SRE thinks about issues such as how to design systems and how to introduce changes safely. The second question is the realm of incident response.

CSE, on the other hand, focuses on the following questions:

  1. How will the system behave near the system boundary?
  2. How should we take this boundary behavior into account in our design?

CSE focuses on the space near the boundary, both to learn how work is actually done, and to inform how we should design tools to better support this work. In the words of Woods and Hollnagel:

> Discovery is aided by looking at situations that are near the margins of practice and when resource saturation is threatened (attention, workload, etc.). These are the circumstances when one can see how the system stretches to accommodate new demands, and the sources of resilience that usually bridge gaps. – Joint Cognitive Systems: Patterns in Cognitive Systems Engineering, p37

Fascinatingly, CSE has also identified common patterns of system behavior at the boundary that holds across multiple domains. But that will have to wait for a different post.

Reading more about CSE

I’m still a novice in the field of cognitive systems engineering. I’m actually using these posts to help learn through explaining the concepts to others.

The source I’ve found most useful so far is the book Joint Cognitive Systems: Patterns in Cognitive Systems Engineering , which is referenced in this post. If you prefer videos, Cook’s Lectures on the study of cognitive work is excellent.

I’ve also started a CSE reading list.

Why you can’t just ask “why”

Today, most AI work is based on neural networks, but back in the 1980s, AI researchers were using a different approach: they built rule-based systems using mathematical logic. This was the heyday of Lisp and Prolog, which were well-suited towards implementing these systems.

One approach AI researchers used was to sit down with an expert and elicit the rules they used to perform a task. For example, an AI researcher might conduct a series of interviews with a doctor in order to determine how the doctor diagnosed illnesses based on symptoms. The researcher would then encode those rules to build an expert system: a software package that would, ideally, perform tasks as well as an expert.

Alas, the results were disappointing: these expert systems never measured up to the performance of those human experts. Two brothers: Stuart Dreyfus (a professor of industrial engineering and operations research) and Hubert Dreyfus (a professor of philosophy) published a book in 1998 titled Mind Over Machine that described why this approach to building expert systems by eliciting and encoding rules from experts could never really work. It turns out that experts don’t actually solve problems by following a set of rules. Instead, they rely more on intuition and pattern-matching based on a repertoire of cases they’ve built up from their experience1.

Yet, even though those experts didn’t solve problems by following rules, they were still able to articulate a set of rules that they claim to follow when asked. And they weren’t trying to deceive the AI researchers. Instead, something else was going on. The experts were inventing explanations without even being aware that they were doing so. Philosophers of mind use the term confabulation (technically broad confabulation) to refer to this phenomenon: how people will unknowingly fabricate explanations for their actions.

And therein lies the problem of asking “why”.

In the wake of an incident, we often want to understand why it is people did certain things: both for the people whose actions contributed to the incident (why did they make a global configuration change?) and for people whose actions mitigated the incident (why did they suspect a retry storm rather than a DDOS attack?)

The problem is, you can’t just ask people why, because people confabulate. You can, of course, simply ask people why they took the actions they did. Heck, you might even get a confident, articulated explanation. But you shouldn’t believe that the explanation they give corresponds to reality.

Yet, getting at the why is important. This is not a case of “‘Why?’ is the wrong question the way that Five Whys style questions are. There is real value in understanding how people came to the decisions they did, by learning about the signals they received at the time, and how their previous experiences shaped their perspectives. That’s where having a skilled interviewer comes in.

A skilled interviewer will increase the chances of getting an accurate response by asking questions to bring the interviewee back into the frame of mind that they were in during the incident. Instead of asking for an engineer to explain their actions (Why did you do X?), they’ll ask questions to try to jog their memory of what they were experiencing during the incident: What were you doing when the page went off? Where did you look first? What did you see? And then what did you do? Because we know that experts do pattern-matching, they’ll also ask questions like, have you ever seen this symptom before? These questions can elicit responses about previous experiences they’ve had in similar situations, which can provide context on how they made their decisions in this case.

Eliciting this sort of information from an interview is hard, and it takes real skill. We should take this sort of work seriously.

1The field of research known as naturalistic decision making studies how experts make decisions.

Asking the right “why” questions

In the [Cognitive Systems Engineering] terminology, it is more important to understand what a joint cognitive system (JCS) does and why it does it, than to explain how it does it. [emphasis in the original]

Erik Hollnagel & David D. Woods, Joint Cognitive Systems: Foundations of Cognitive Systems Engineering, p22

In my previous post, I linked to a famous essay by John Allspaw: The Infinite Hows (or, the Dangers Of The Five Whys). The main thrust of Allspaw’s essay can be summed up in this five word excerpt:

“Why?” is the wrong question.

As illustrated by the quote from Hollnagel & Woods at the top of this post, it turns out that cognitive systems engineering (CSE) is very big on answering “why” questions. Allspaw’s perspective on incident analysis is deeply influenced by research from cognitive systems engineering. So what’s going on here?

It turns out that the CSE folks are asking different kinds of “why” questions than the root cause analysis (RCA) folks. The RCA folks ask why did this incident happen? The CSE folks ask why did the system adapt the sorts of behaviors that contributed to the incident?

Those questions may sound similar, but they start from opposite assumptions. The RCA folks start with the assumption that there’s some sort of flaw in the system, a vulnerability that was previously unknown, and then base their analysis on identifying what that vulnerability was.

The CSE folks, on the other hand, start with the assumption that behaviors exhibited by the system developed through adaptation to existing constraints. The “why” question here is “why is this behavior adaptive? What purpose does it serve in the system?” Then they base the analysis on identifying attributes of the system such as constraints and goal conflicts that would explain why this behavior is adaptive.

This is one of the reasons why the CSE folks are so interested in incidents to begin with: because it can expose these kinds of constraints and conflicts that are part of the context of a system. It’s similar to how psychologists use optical illusions to study the heuristics that the human visual system employs: you look at the circumstances under which a system fails to get some insight into how it normally functions as well as it does.

“Why” questions can be useful! But you’ve got to ask the right ones.