Conscription devices, boundary objects, and GDocs

In the paper Flexible Sketches and Inflexible Data Bases: Visual Communication, Conscription Devices, and Boundary Objects in Design Engineering, Kathryn Henderson writes about the role that sketches and drawings play in the work of mechanical engineers.

A conscription device is something that can be used to help recruit other people to get involved in a task: mechanical engineers collaborate using diagrams. These diagrams play such a strong role that the engineers find that they can’t work effectively without them. From the paper:

If a visual representation is not brought to a meeting of those involved with the design, someone will sketch a facsimile on a white board (present in all engineering conference rooms) when communication begins to falter, or a team member will leave the meeting to fetch the crucial drawings so group members will be able to understand one another.

A boundary object is an artifact that can be consumed by different stakeholders, who use the artifact for different purposes. Henderson uses the example of the depiction of a welded joint in a drawing, which has different meanings for the designer (support structure) than it does for someone working in the shop (labor required to do the weld). A shop worker might see the drawing and suggest a change that would save welds (and hence labor):

Detail renderings are one of the tightly focused portions that make up the more flexible whole of a drawing set. For example, the depiction of a welded joint may stand for part of the support structure to the designer and stand for labor expended to those in the shop.If the designer consults with workers who suggest a formation that will save welds and then incorporates the advice, collective knowledge is captured in the design. One small part of the welders’ tacit knowledge comes to be represented visually in the drawing. Hence the flexibility of the sketch or drawing as a boundary object helps in enlisting the aid and knowledge of additional participants.

Because we software engineers don’t work in a visual medium, we don’t work from visual representations the way that mechanical engineers do. However, we still have a need to engage with other engineers to work with us, and we need to communicate with different stakeholders about the software that we build.

A few months ago, I wrote up a Google doc with a spec for some proposed new functionality for a system that I work on. It included scenario descriptions that illustrated how a user would interact with the system. I shared the doc out, and got a lot of feedback, some of it from potential users of the system who were looking for additional scenarios, and some from adjacent teams who were concerned about the potential misuse of the feature for unintended purposes.

This sort of Google doc does function like a conscription device and boundary object. Google makes it easy to add comments to a doc. Yes, comments don’t scale up well, but the ease of creating a comment makes Google docs effective as potential conscription devices. If you share the doc out, and comments are enabled, people will comment.

I also found that writing out scenarios, little narrative descriptions of people interacting with the system, made it easier for people to envision what using the system will be like, and so I consequently got feedback from different types of stakeholders.

My point here is not that scenarios written in Google docs are like mechanical engineering drawings: those are very different kinds of artifacts that play different roles. Rather, the point is that properties of an artifact can affect how people collaborate to get engineering work done. You probably don’t think of a Google doc as a software engineering tool. But it can be an extremely powerful one.

Coding as a tool of thought

With apologies to Ken Iverson.

Architects draw detailed plans before a brick is laid or a nail is hammered. Programmers and software engineers don’t. Can this be why houses seldom collapse and programs often crash?
Blueprints help architects ensure that what they are planning to build will work. “Working” means more than not collapsing; it means serving the required purpose. Architects and their clients use blueprints to understand what they are going to build before they start building it.
But few programmers write even a rough sketch of what their programs will do before they start coding.
Leslie Lamport, Why We Should Build Software Like We Build Houses

“My instinct is to go right to the board. I’m very graphic oriented. I can’t talk more than ten minutes without I [sic] start drawing pictures when we’re talking about the things I do. Even if I’m talking sports, I invariably start diagramming what’s going on. I feel comfortable with it or find it very effective.”
This designer, like the newly promoted engineer at Selco who fought to get her drafting board back, is pointing out his dependence on the visual process, of drawing both to communicate and to think out the initial design. He also states that the visual and manual thought process of drawing precedes the formulation of the written specifications for the project. Like the Selco designers and those at other sites, he emphasized the importance of drawing processes to work out ideas. [emphasis added]
Kathryn Henderson, On Line and On Paper: Visual Representations, Visual Culture, and Computer Graphics in Design Engineering

The quote above by Kathryn Henderson illustrates how mechanical engineers use the act of drawing to help them work on the design problem. By generating sketches and drawings, they develop a better understanding of the problem they trying to solve. They use drawing as a tool to help them think, to work through the problem.

As software engineers, we don’t work in a visual medium in the way that mechanical engineers do. And yet, we also use tools to help us think through the problem. It just so happens that the tool we use is code. I can’t speak for other developers, but I certainly use the process of writing code to develop a deeper understanding of the problem I’m trying to solve. As I solve parts of the problem with code, my mental model of the problem space and solution space develops throughout the process.

I think we use coding this way (I certainly do), because it feels to me like the fastest way to evolve this knowledge. If I had some sort of equivalent mechanism for sketching that was faster than coding for developing my understanding, I’d use it. But I know of no mechanism that’s actually faster than coding that will let me develop my understanding of the solution I’m working on. It just so happens that the quickest solution, code, is the same medium as the artifact that will ultimately end up in production. A mechanical engineer can never ship their sketches, but we can ship our code.

And this is a point that I think Leslie Lamport misses. I’m personally familiar with a number of different techniques for modeling in software, including TLA+, Alloy, statecharts, and decision tables. I’ve used them all, they are excellent tools for reasoning about the complexity of the system. But none of these tools really fulfill the role that sketching does for mechanical engineers (although Alloy’s fast feedback for incremental model building is a nice step in this direction).

TLA+ in particular is a wonderful tool. I’ve used it for things like understanding the linearizability paper, beating the CAP theorem, finding cycles in linked lists, solving the river crossing problem, and proving leftpad. But if you’re looking for an analogy in mechanical engineering, TLA+ is much closer to finite element analysis than it is to sketches.

Developers jump to coding not because they are sloppy, but because they have found it to be the most effective tool for sketching, for thinking about the problem and getting quick feedback as they construct their solution. And constructing a representation to develop a better understanding using the best tools they have available for the job to get quick feedback is what engineers do.

Uber’s adventures in the adaptive universe

It’s 2016, and Uber engineers are facing a problem. Their software system has become brittle: many in the organization feel that it’s too hard to make changes to it without breaking things.

And so, they adapt: they build a new architecture, one that’s designed to enable teams to move more quickly. As part of the re-architecture, they reach for a new technology to rewrite the iOS client in: the Swift language.

The new architecture experiment is deemed a success, and is rolled out to the entire company. A florescence ensues in the organization, as teams excitedly migrate to the new architecture and experience a boost to their development productivity.

However, as development against the new architecture ramps up, anomalies related to Swift begin to emerge. Because of implementation details in the Swift linker, Apple recommends limiting the number of shared libraries to six: Uber has ninety-two, and the number is growing. The linker is saturated, and as a result, app startup is extremely slow. It takes eight to twelve seconds (!) to start up the app. The rewrite was supposed to yield a faster iOS app, and it’s slower than the previous version!

So the engineers adapt. They discover they can work around the problem by putting all of the code in the main executable instead of linking it via libraries, eliminating the startup delay. Unfortunately, to do this would require a huge code change because an implementation detail of Swift, but they find another workaround: an enterprising engineer writes a custom script to relink intermediate object files that avoids the need to change the code. And it works!

But they encounter another anomaly: the Swift-based iOS app binary is big… too big. It’s so big that they’re running into the Apple cellular download limit.

For users who want to download the Uber app to their iPhones over the cellular network, Apple places a hard limit of 100MB on the size of the download: any bigger, and the phone won’t let you download it unless you’re on wifi. Once again, the Uber engineers are hitting a saturation point, only now the limit is space instead of time. To add insult to injury, their workaround to deal with the startup time problem exacerbated the size problem!

There are further workarounds they can do to save space, like replace structs with classes. But it isn’t enough. The data scientists run an experiment to estimate the cost to the organization of the app breaching the cellular download limit: and the risk of catastrophic. It turns out that many people download the app for the first time on a cellular network. The estimated cost to the business is orders of magnitude more than the cost of the rewrite.

The engineers have to make some hard choices. Their original plan was to bundle the old and new versions of the app in the same app bundle, so that they could do a slow rollout to reduce the blast radius if there was a problem with the new version. They are facing a goal conflict, and so they make a sacrifice judgment. They remove the old version of this app. They call this the “Yolo” release strategy.

They face another goal conflict: they can take advantage of a new capability in iOS 9 that will reduce the binary size by 50%, but to do so they have to drop support for iOS 8. They estimate that this will decision will have a dollar of eight figures. With only a week to go before release, they drop iOS 8 and eat the cost to come get under the cellular download limit.

The engineers believe that dropping iOS 8 support should provide them with enough headroom to figure out a strategy for dealing with the 100 MB download limit, given the project slowdown in the growth of the app. But their model of the growth rate is wrong: the app is growing too quickly. There’s a risk of decompensation, of not being able to work around the growth rate of the app.

And so the engineers adapt. They form a strike team to come up with approaches for bringing the app size under control. They employ workarounds such as deleting unused features, checking for expensive code patterns, and rewriting the Apple Watch app in Objective C.

An Uber engineer in the Amsterdam office comes up with an innovative work around: he uses an annealing algorithm to re-order the Swift compiler’s optimization passes to minimize the size of the resulting binary. And it works! It also terrifies the Swift compiler engineers, as they haven’t tested running the optimization passes in arbitrary orders.

And yet, the risk of decompensation is ever-present: the strike team worries about their space saving wins will not be able to keep pace with the growth of the applications.

Fortunately, Apple moves the boundary: increasing the cellular download limit to 150 MB and introducing new size optimization features in the Swift compiler.

The above is my retelling of a Twitter thread by McLaren Stanley, a former Uber engineer. I highly recommend reading the original thread in full. My writing above is based solely on that thread, I don’t have any additional information, and I probably got some stuff wrong. I also created a concept map based on Stanley’s thread.

Alright folks, gather round and let me tell you the story of (almost) the biggest engineering disaster I’ve ever had the misfortune of being involved in. It’s a tale of politics, architecture and the sunk cost fallacy [I’m drinking an Aberlour Cask Strength Single Malt Scotch] https://t.co/KWnZKIpycp
— McLaren Stanley (@StanTwinB) December 10, 2020

I wrote the post above using the frame of what the researcher David Woods calls the adaptive universe. I tried to cast events in terms of people undergoing pressure, encountering risks of saturation, and then adapting in the face of that pressure, and those adaptations leading to reverberations that introduce unexpected change in the system. Woods calls these adaptive cycles.

I’ve previously written briefly about the adaptive universe, but to learn more about this model, check out this material by Woods:

Software as a limited medium

The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination. Few media of creation are so flexible, so easy to polish and rework, so readily capable of realizing grand conceptual structures.
Fred Brooks, The Mythical Man-Month

We software engineers don’t work in a physical medium the way, say, civil, mechanical, electrical, or chemical engineers do. Yes, our software does run on physical machines, and we are not exempt from dealing with limits. But, as captured in that Fred Brooks quote above, there’s a sense in which we software folk feel that we are working in a medium that is limited only by our own minds, by the complexity of these ethereal artifacts we create. When a software system behaves in an unexpected way, we consider it a design flaw: the engineer was not sufficiently smart.

And, yet, contra Brooks, software is a limited medium. Let’s look at two areas where that’s the case.

Software is discrete in a way that the world isn’t

We persist our data in databases that have schemas, which force us to slice up our information in ways that we can represent. But the real world is not so amenable to this type of slicing: it’s a messy place. The mismatch between the messiness of the real world and the structured nature of software data representations results in a medium that is not well-suited to model the way humans treat concepts such as names or time.

Software as a medium, and data storage in particular, encourages over-simplification of the world, because we need to categorize our data, figure out which tables to store it in and what values those columns should have, and so many items in the world just aren’t easy to model well like that.

As an example, consider a common question in my domain, software deployment: is a cluster up? We have to make a decision about that, and yet the answer is often “it depends: why do you want to know”? But that’s not what software as a medium encourages. Instead, we pick a definition of “up”, implement it, and then hope that it meets most needs, knowing it won’t. We can come up with other definitions for other circumstances, but we can’t be comprehensive, and we can’t be flexible. We have to bake in those assumptions.

And so, just like all engineers, given our time and resource constraints, we have to make over-simplifications to get our work done. William Kent wrote a whole book on this topic called Data and Reality: A Timeless Perspective on Perceiving and Managing Information in Our Imprecise World (h/t Hillel Wayne).

Software systems are limited in how they integrate inputs

In the book Problem Frames, Michael Jackson describes several examples of software problems. One of them is a system for counting how many cars pass by on a street. The inputs are two sensors that emit a signal when the cars drive over them. Those two sensors provide a lot less input than a human would have sitting by the side of the road and counting the cars go by.

As humans, when we need to make decisions, we can flexibly integrate a lot of different information signals. If I’m talking to you, for example, I can listen to what you’re saying, and I can also read the expressions on your face. I can make judgments based on how you worded your Slack message, and based on how well I already know you. I can use all of that different information to build a mental model of your actual internal state. Software isn’t like that: we have to hard-code, in advance, the different inputs that the software system will use to make decisions. Software as a medium is inherently limited in modeling external systems that it interacts with.

A couple of months ago, I wrote a blog post titled programming means never getting to say “it depends”, where I used the example of an alerting system: when do you alert a human operator of a potential problem? As humans, we can develop mental models of the human operator: “does the operator already know about X? Wait, I see that they are engaged based on their Slack messages, so I don’t need to alert them, they’re already on it.”

Good luck building an alerting system that constructs a model of the internal state of a human operator! Software just isn’t amenable to incorporating all of the possible signals we might get from a system.

Recognizing the limits of software

The lesson here is that there are limits to how well software system can actually perform, given the limits of software. It’s not simply a matter of managing complexity or avoiding design flaws: yes, we can always build more complex schemas to handle more cases, and build our systems to incorporate large input sets, but this is the equivalent of adding epicycles. Incorrect categorizations and incorrect automated decisions are inevitable, no matter how complex our systems become. They are inherent to the nature of software systems. We’re always going to need to have humans-in-the-loop to make up for these sorts of shortcomings.

The goal is not to build better software systems, but how to build better joint cognitive systems that are made up of humans and software together.

Top-down code reviews

I’ve long been frustrated by the task of code reviews. Often, the pull request (PR) I’m reviewing involves a part of the codebase I’m not intimately familiar with. I read it, not quite understanding it, looking to see if I can offer some sort of useful feedback, and typically that feedback would be on the micro level (e.g., “you can simplify this function by calling this other library function instead”).

I recently started experimenting with a new review approach that I’m going to call top down code review. Here’s how it works: I start by understanding the code well enough so that I can write my own version of the pull request message, describing the PR in my own words. After I’ve done this, then I provide feedback.

I call this approach “top down” because the review that I end up generating starts with a “top down” description of the PR: the problem it’s trying to solve, and the solution approach, before diving into describing notable implementation details. Here are the reviews I’ve done in this style so far:

I’ve been finding this approach useful because it forces me to come to terms with how well I really understand the PR. If I can’t explain the PR in my own words, then I don’t really understand it. It also helps me figure out what questions to ask the original author to help clarify things for me.

I also get more of a sense of closure after doing the review. Even if I had no feedback to give, I understand the changes in a way that I didn’t before.