The surprising power of a technical document written by experts

Good technical writing can have enormous influence. In my last blog post, I wrote about how technical reports written by management consultants can be used to support implementing a change program inside of an organization.

People underestimate how influential such technical document can be. They have to be written by experts to be effective, and management consultants are really just mercenary “experts”, but they aren’t the only type of experts who can write influential documents.

I was recently listening to on an episode of the Ezra Klein Show, where climate scientist Kate Marvel was being interviewed by (guest interviewer) David Wallace-Wells, when I heard another example of this phenomenon.

Here’s an excerpt from the transcript (emphasis added):

(Marvel) And in, I want to say 2018 because that was the release of the U.N.‘s 1.5 degree Special Report — which, mea culpa, I was grouchy about.

I thought it was fan fiction. I thought, well, there’s no way we’re going to limit warming to 1.5 degrees. Why are you doing this? And oh, boy. What the world needs is another report. Great. Let’s do that again. And for reasons that I don’t understand, I was so wrong.

I was so wrong about how that was going to be received. I was so wrong about how that would land. And it started something. Now —

(Wallace-Wells) The same year that Greta started striking, the foundation of XR, the sit-in of Sunrise.

(Marvel) Sunrise. To talk about tipping points, that’s not something that I was able to anticipate. And now, I almost never get asked, is it real? I almost never get asked, well, what does climate change mean and why should I care? Instead, I get asked the really good questions about uncertainty, about what’s happening, about how we can prepare, about what we can do.

The irony here is that Marvel is a scientist, a professional whose primary output is technical documents! And, yet, Marvel didn’t recognize the impact that a technical report could have on the overall system. It didn’t actually matter that it’s not possible to limit warning to 1.5°C. What mattered was how the document itself ended up changing the system.

Don’t underestimate the power of a technical document. Like any effective system intervention, it has to happen at the right place and the right time. But, if it does, it can make a real difference.

On productivity metrics and management consultants

The management consulting firm McKinsey & Company recently posted a blog post titled Yes, you can measure software developer productivity. The post prompted a lot of responses, such as Kent Beck and Gergely Orosz’s Measuring developer productivity? A response to McKinsey, Dan North’s The Worst Programmer I Know, and John Cutler’s The Ultimate Guide to Developer Counter-Productivity.

Now, I’m an avowed advocate of qualitative approaches to studying software development, but I started out my academic research career on the quantitative side, doing research into developer productivity metrics. And so I started to read the McKinsey post with the intention of writing a response, on why qualitative approaches are better for gaining insight into productivity issues. And I hope to write that post soon. But something jumped out at me that changed what I wanted to write about today. It was this line in particular (emphasis mine):

For example, one company that had previously completed a successful agile transformation learned that its developers, instead of coding, were spending too much time on low-value-added tasks such as provisioning infrastructure, running manual unit tests, and managing test data. Armed with that insight, it launched a series of new tools and automation projects to help with those tasks across the software development life cycle.

I realized that I missed the whole point of this post. The goal isn’t to gain insight, it’s to justify funding a new program inside an organization.

In order to effect change in an organization, you need political capital, even if you’re an executive. That’s because making change in an organization is hard, programs are expensive and don’t bear fruit for a long time, and so you need to get buy-in in order to make things happen.

McKinsey is a management consulting firm. One of the services that management consulting firms provide is that they will sell you political capital. They provide a report generated by external experts that their customers can use as leverage within their organizations to justify change programs.

As Lee Clarke describes in his book Mission Improbable: Using Fantasy Documents to Tame Disaster, technical reports written by experts have rhetorical, symbolic power, even if the empirical foundations of the reports are weak. Clarke’s book focuses on the unverified nature of disaster recovery documents, but the same holds true for reports based on software productivity metrics.

If you want to institute a change to a software development organization, and you don’t have the political capital to support it, then building a metrics program that will justify your project is a pretty good strategy if you can pull that off: if you can define metrics that will support the outcome that you want, and you can get the metrics program in place, then you can use it as ammunition for the new plan. (“We’re spending too much time on toil, we should build out a system to automate X”).

Of course, this sounds extremely cynical. You’re creating a metrics program where you know in advance what the metrics are going to show, with the purpose of justifying a new program you’ve already thought of? You’re claiming that you want to study a problem when you already have a proposed solution in the wings! But this is just how organizations work.

And, so, it makes perfect sense that McKinsey & Company would write a blog post like this. They are, effectively, a political-capital-as-a-service (PCaaS?) company. Helping executives justify programs inside of companies is what they do for a living. But they can’t simply state explicitly how the magic trick actually works, because then it won’t work anymore.

The danger is when the earnest folks, the ones who are seeking genuine insight into the nature of software productivity issues in their organization, read a post like this. Those are the ones I want to talk to about the value of a qualitative approach for gaining insight.

Operating effectively in high surprise mode

When you deploy a service into production, you need to configure it with enough resources (e.g., CPU, memory) so that it can handle the volume of requests you expect it to receive. You’ll want to provision it so that it can service 100% of the requests when receiving the typical amount of traffic, and you probably want some buffer in there as well.

However, as a good operator, you also know that sometimes your service will receive an unexpected increase in traffic that’s a large enough to push your service beyond the resources that you’ve been provisioned for it, even with that extra buffer.

When your service is overloaded, even though it can’t service 100% of the requests, you want to design it so that it doesn’t simply keel over and service 0% of the requests. There are well-known patterns for designing a service to degrade gracefully in the face of overload, such that it can still service some requests, and that keep it from getting so overloaded that it can’t even recover when the traffic abates. These patterns include rate limiters and circuit breakers. Michael Nygard’s book Release It! is a great source for this, and the concepts he describes have been implemented in libraries such as Hystrix and Resilience4j.

You can think of “expected number of requests” and “too many requests” as two different modes of operation of your service: you want to design it so that it performs well in both modes.

A service switching operational modes from “normal amount of requests” to “too many requests”

Now, imagine in the graph above, instead of the y-axis being “number of requests seen by the service”, it’s “degree of surprise experienced by the operators”.

As we humans navigate the world, we are constantly taking in sensory input. Imagine if I asked you, at regular intervals, “On a scale of 1-10, how surprised are you about your current observations of the world?”, and I plotted it on a graph like the one above. During a typical day, the way we experience the world isn’t too surprising. However, every so often, our observations of the world just don’t make sense to us. The things we’re seeing just shouldn’t be happening, given our mental models of how the world works. When it’s the software’s behavior that’s surprising, and that surprising behavior has a significant negative impact on the business, we call it an incident.

And, just like a software service behaves differently under a very high rate of inbound requests than it does from the typical rate, your socio-technical system (which includes your software and your people) is going to behave differently under high levels of surprise than it does under typical levels.

Similarly, just like you can build your software system to deal more effectively with overload, you can also influence your socio-technical system to deal more effectively with surprise. That’s really what the research field of resilience engineering is about: understanding how some socio-technical systems are more effective than others when working in high surprise mode.

It’s important to note that being more effective at high surprise mode is not the same as trying to eliminate surprises in the future. Adding more capacity to your software service enables it to handle more traffic, but it doesn’t help deal with the situation where the traffic exceeds even those extra resources. Rather, your system needs to be able to change what it does under overload. Similarly, saying “we are going to make sure we handle this scenario in the future” does nothing to improve your system’s ability to function effectively in high surprise mode.

I promise you, your system is going to enter high surprise mode in the future. The number of failure modes that you have eliminated does nothing to improve your ability to function well when this mode happens. While RCA will eliminate a known failure mode, LFI will help your system function better in high surprise mode.

Normal incidents

In 1984, the late sociologist Charles Perrow published the book: Normal Accidents: Living with High-Risk Technologies. In this book, he proposed a theory that accidents were unavoidable in systems that had certain properties, and nuclear power plants had these properties. In such systems, accidents would inevitably occur during the normal course of operations.

You don’t hear much about Perrow’s Normal Accident Theory these days, as it has been superseded by other theories in safety science, such as High Reliability Organizations and Resilience Engineering (although see Hopkins’s 2013 paper Issues in safety science for criticisms of all three theories). But even rejecting the specifics of Perrow’s theory, the idea of a normal accident or incident is a useful one.

An incident is an abnormal event. Because of this, we assume, reasonably, that an incident must have an abnormal cause: something must have gone wrong in order for the incident to have happened. And so we look to find where the abnormal work was, where it was that someone exercised poor judgment that ultimately led to the incident.

But incidents can happen as a result of normal work, when everyone whose actions contributed to the incident was actually exercising reasonable judgment at the time they committed those actions.

This concept, that all actions and decisions that contributed to an incident were reasonable in the moment they were made, is unintuitive. It requires a very different conceptual model of how incidents happen. But, once you adopt this conceptual model, it completely changes the way you understand incidents. You shift from asking “what was the abnormal work?” to “how did this incident happen even though everyone was doing normal work?” And this yields very different insights into how the system actually works, how it is that incidents don’t usually happen due to normal work, and how it is that they occasionally do.

Why LFI is a tough sell

There are two approaches to doing post-incident analysis:

  • the (traditional) root cause analysis (RCA) perspective
  • the (more recent) learning from incidents (LFI) perspective

In the RCA perspective, the occurrence of an incident has demonstrated that there is a vulnerability that caused the incident to happen, and the goal of the analysis is to identify and eliminate the vulnerability.

In the LFI perspective, an incident presents the organization with an opportunity to learn about the system. The goal is to learn as much as possible with the time that the organization is willing to devote to post-incident work.

The RCA approach has the advantage of being intuitively appealing. The LFI approach, by contrast, has three strikes against it:

  1. LFI requires more time and effort than RCA
  2. LFI requires more skill than RCA
  3. It’s not obvious what advantages LFI provides over RCA.

I think the value of LFI approach is based on assumptions that people don’t really think about because these assumptions are not articulated explicitly.

In this post, I’m going to highlight two of them.

Nobody knows how the system really works

The LFI approach makes the following assumption: No individual in the organization will ever have an accurate mental model about how the entire system works. To put it simply:

  • It’s the stuff we don’t know that bites us
  • There’s always stuff we don’t know

By “system” here, I mean the socio-technical system, which includes both the software and what it does, and the humans who do the work to develop and operate the system.

You’ll see the topic of incorrect mental models discussed in the safety literature in various ways. For example, David Woods uses the term miscalibration to describe incorrect mental models, and Diane Vaughan writes about structural secrecy, which is a mechanism that leads to incorrect mental models.

But incorrect mental models are not something we talk much about explicitly in the software world. The RCA approach implicitly assumes there’s only a single thing that we didn’t know: the root cause of the incident. Once we find that, we’re done.

To believe that the LFI approach is worth doing, you need to believe that there is a whole bunch of things about the system that people don’t know, not just a single vulnerability. And there are some things that, say, Alice knows that Bob doesn’t, and that Alice doesn’t know that Bob doesn’t know.

Better system understanding leads to better decision making in the future

The payoff for RCA is clear: the elimination of a known vulnerability. But the payoff for LFI is a lot fuzzier: if the people in the organization know more about the system, they are going to make better decisions in the future.

The problem with articulating the value is that we don’t know when these future decisions will be made. For example, the decision might happen when responding to the next incident (e.g., now I know how to use that observability tool because I learned from how someone else used it effectively in the last incident). Or the decision might happen during the design phase of a future software project (e.g., I know to shard my services by request type because I’ve seen what can go wrong when “light” and “heavy” requests are serviced by same cluster) or during the coding phase (e.g., I know to explicitly set a reasonable timeout because Java’s default timeout is way too high).

The LFI approach assumes that understanding the system better will advance the expertise of the engineers in the organization, and that better expertise means better decision making.

On the one hand, organizations recognize that expertise leads to better decision making: it’s why they are willing to hire senior engineers even though junior engineers are cheaper. On the other hand, hiring seems to be the only context where this is explicitly recognized. “This activity will advance the expertise of our staff, and hence will lead to better future outcomes, so it’s worth investing in” is the kind of mentality that is required to justify work like the LFI approach.

Active knowledge

Existential Comics is an extremely nerdy webcomic about philosophers, written and drawn by Corey Mohler, a software engineer(!). My favorite Existential Comics strip is titled Is a Hotdog a Sandwich? A Definitive Study. The topic is… exactly what you would expect:

At the risk of explaining a joke: the punchline is that we can conclude that a hotdog isn’t a sandwich because people don’t generally refer to hotdogs as sandwiches. In Wittgenstein’s view, the meaning of a phrase isn’t determined by a set of formal criteria. Instead, language is use.

In a similar spirit, in his book Designing Engineers, Louis Bucciarelli proposed that we should understand “knowing how something works” to mean knowing how to work it. He begins with an anecdote about telephones:

A few years ago, I attended a national conference on technological literacy… One of the main speakers, a sociologist, presented data he had gathered in the form of responses to a questionnaire. After a detailed statistical analysis, he had concluded that we are a nation of technological illiterates. As an example, he noted how few of us (less than 20 percent) know how our telephone works.

This statement brought me up short. I found my mind drifting and filling with anxiety. Did I know how my telephone works?

Bucciarelli tries to get at what the speaker actually intended by “knowing how a telephone works”.

I squirmed in my seat, doodled some, then asked myself, What does it mean to know how a telephone works? Does it mean knowing how to dial a local or long-distance number? Certainly I knew that much, but this does not seem to be the issue here.

He dives down a level of abstraction into physical implementation details.

No, I suspected the question to be understood at another level, as probing the respondent’s knowledge of what we might call the “physics of the device.”

I called to mind an image of a diaphragm, excited by the pressure variations of speaking, vibrating and driving a coil back and forth within a a magnetic field… If this was what the speaker meant, then he was right: Most of us don’t know how our telephone works.

But then Bucciarelli continues to elaborate this scenario:

Indeed, I wondered, does [the speaker] know how his telephone works? Does he know about the heuristics used to achieve optimum routing for long distance calls? Does he know about the intricacies of the algorithms used for echo and noise suppression? Does he know how a signal is transmitted to and retrieved from a satellite in orbit? Does he know how AT&T, MCI, and the local phone companies are able to use the same network simultaneously? Does he know how many operators are needed to keep this system working, or what those repair people actually do when they climb a telephone pole? Does he know about corporate financing, capital investment strategies, or the role of regulation in the functioning of this expansive and sophisticated communication system?

Does anyone know how their telephone works?

At this point, I couldn’t help thinking of that classic tech interview question, “What happens when you type a URL into address bar of your web browser and hit enter”? It’s a fun question to ask precisely because there are so many different aspects to the overall system that you could potentially dig in on (Do you know how your operating system services keyboard interrupts? How your local Wi-Fi protocol works?). Can anyone really say that they understand everything that happens after hitting enter?

Because no individual possesses this type of comprehensive knowledge of engineered systems, Bucciarelli settles on a definition that relies on active knowledge: knowing-how-it-works as knowing-how-to-use-it.

No, the “knowing how it works” that has meaning and significance is knowing how to do something with the telephone—how to act on it and react to it, how to engage and appropriate the technology according to one’s needs and responsibilities.

I thought of Bucciarelli’s definition while reading Andy Clark’s book Surfing Uncertainty. In Chapter 6, Clark claims that our brain does not need to account for all of its sensory input to build a model of what’s happening in the world. Instead, it relies in simpler models that are sufficient for determining how to act (emphasis mine).

This may well result … in the use of simple models whose power resides precisely in their failing to encode every detail and nuance present in the sensory array. For knowing the world, in the only sense that can matter to an evolved organism, means being able to act in that world: being able to respond quickly and efficiently to salient environmental opportunities.

The through line that connects Wittgenstein, Bucciarelli, and Clark, is the idea of knowledge as an active thing. Knowing implies using and acting. To paraphrase David Woods, knowledge is a verb.

Resilience requires helping each other out

A common failure mode in complex systems is that some part of the system hits a limit and falls over. In the software world, we call this phenomenon resource exhaustion, and a classic example of this is running out of memory.

The simplest solution to this problem is to “provision for peak”: to build out the system so that it always has enough resources to handle the theoretical maximum load. Alas, this solution isn’t practical: it’s too expensive. Even if you manage to overprovision the system, over time, it will get stretched to its limits. We need another way to mitigate the risk of overload.

Fortunately, it’s rare for every component of a system to reach its limit simultaneously: while one component might get overloaded, there are likely other components that have capacity to spare. That means that if one component is in trouble, it can borrow resources from another one.

Indeed, we see this sort of behavior in biological systems. In the paper Allostasis: A Model of Predictive Regulation, the neuroscientist Peter Sterling explains why allostasis is a better theory than homeostasis. Readers are probably familiar with the term homeostasis: it refers to how your body maintains factors in a narrow range, like keeping your body temperature around 98.6°F. Allostasis, on the other hand, is about how your body predicts where these sorts of levels should be, based on anticipated need. Your body then takes action to modify the current state of these levels. Here’s Sterling explaining why he thinks allostasis is superior, referencing the idea of borrowing resources across organs (emphasis mine)

A second reason why homeostatic control would be inefficient is that if each organ self-regulated independently, opportunities would be missed for efficient trade-offs. Thus each organ would require its own reserve capacity; this would require additional fuel and blood, and thus more digestive capacity, a larger heart, and so on – to support an expensive infrastructure rarely used. Efficiency requires organs to trade-off resources, that is, to grant each other short-term loans.

The systems we deal with are not individual organisms, but organizations that are made up of groups of people. In organization-style systems, this sort of resource borrowing becomes more complex. Incentives in the system might make me less inclined to lend you resources, even if doing so would lead to better outcomes for the overall system. In his paper The Theory of Graceful Extensibility: Basic rules that govern adaptive systems, David Woods borrows the term reciprocity from Elinor Ostrom to describe this property in a system of one agent being willing to lend resources to another as a necessary ingredient for resilience (emphasis mine):

Will the neighboring units adapt in ways that extend the [capacity for maneuver] of the adaptive unit at risk? Or will the neighboring units behave in ways that further constrict the [capacity for maneuver] of the adaptive unit at risk? Ostrom (2003) has shown that reciprocity is an essential property of networks of adaptive units that produce sustained adaptability.

I couldn’t help thinking of the Sterling and Woods papers when reading the latest issue of Nat Bennett’s Simpler Machines newsletter, titled What was special about Pivotal? Nat’s answer is reciprocity:

This isn’t always how it went at Pivotal. But things happened this way enough that it really did change people’s expectations about what would happen if they co-operated – in the game theory, Prisoner’s Dilemma sense. Pivotal was an environment where you could safely lead with co-operation. Folks very rarely “defected” and screwed you over if you led by trusting them.

People helped each other a lot. They asked for help a lot. We solved a lot of problems much faster than we would have otherwise, because we helped each other so much. We learned much faster because we helped each other so much.

And it was generally worth it to do a lot of things that only really work if everyone’s consistent about them. It was worth it to write tests, because everyone did. It was worth it to spend time fixing and removing flakes from tests, because everyone did. It was worth it to give feedback, because people changed their behavior. It was worth it to suggest improvements, because things actually got better.

There was a lot of reciprocity.

Nat’s piece is a good illustration of the role that culture plays in enabling a resilient organization. I suspect it’s not possible to impose this sort of culture, it has to be fostered. I wish this were more widely appreciated.

If you can’t tell a story about it, it isn’t real

We use stories to make sense of the world. What that means is that when events occur that don’t fit neatly into a narrative, we can’t make sense of them. As a consequence, these sorts of events are less salient, which means they’re less real.

In The Invisible Victims of American Anti-Semitism, Yair Rosenberg wrote in the Atlantic about the kinds of attacks that target Jews which don’t get much attention in the larger media. His claim is that this happens when these attacks don’t fit into existing narratives about anti-Semitism (emphasis mine):

What you’ll also notice is that all of the very real instances of anti-Semitism discussed above don’t fall into either of these baskets. Well-off neighborhoods passing bespoke ordinances to keep out Jews is neither white supremacy nor anti-Israel advocacy gone awry. Nor can Jews being shot and beaten up in the streets of their Brooklyn or Los Angeles neighborhoods by largely nonwhite assailants be blamed on the usual partisan bogeymen.

That’s why you might not have heard about these anti-Semitic acts. It’s not that politicians or journalists haven’t addressed them; in some cases, they have. It’s that these anti-Jewish incidents don’t fit into the usual stories we tell about anti-Semitism, so they don’t register, and are quickly forgotten if they are acknowledged at all.

In The 1918 Flu Faded in Our Collective Memory: We Might ‘Forget’ the Coronavirus, Too, Scott Hershberger speculated in Scientific American along similar lines about why historians paid little attention the Spanish Flu epidemic, even though it killed more people than World War I (emphasis mine):

For the countries engaged in World War I, the global conflict provided a clear narrative arc, replete with heroes and villains, victories and defeats. From this standpoint, an invisible enemy such as the 1918 flu made little narrative sense. It had no clear origin, killed otherwise healthy people in multiple waves and slinked away without being understood. Scientists at the time did not even know that a virus, not a bacterium, caused the flu. “The doctors had shame,” Beiner says. “It was a huge failure of modern medicine.” Without a narrative schema to anchor it, the pandemic all but vanished from public discourse soon after it ended.

I’m a big believer in the role of interactions, partial information, uncertainty, workarounds, tradeoffs, and goal conflicts as contributors to systems failures. I think the way to convince other people to treat these entities as first-class is to weave them into the stories we tell about how incidents happen. If we want people to see these things as real, we have to integrate them into narrative descriptions of incidents.

Because, If we can’t tell a story about something, it’s as if it didn’t happen.

When there’s no plan for this scenario, you’ve got to improvise

An incident is happening. Your distributed system has somehow managed to get itself stuck in a weird state. There’s a runbook, but because the authors didn’t foresee this failure mode ever happening, the runbook isn’t actually helpful here. To get the system back into a healthy state, you’re going to have to invent a solution on the spot.

In other words, you’re going to have to to improvise.

“We gotta find a way to make this fit into the hole for this using nothing but that.” – scene from Apollo 13

Like uncertainty, improvisation is an aspect of incident response that we typically treat as a one-off, rather than as a first-class skill that we should recognize and cultivate. Not every incident requires improvisation to resolve, but the hairiest ones will. And it’s these most complex of incidents that are the ones we need to worry about the most, because they’re the ones that are costliest to the business.

One of the criticisms of resilience engineering as a field is that it isn’t prescriptive. Often, a response I’ll hear about resilience engineering research I talk about is “OK, Lorin, that’s interesting, but what should I actually do?” I think resilience engineering is genuinely helpful, and in this case it teaches us that improvisation requires local expertise, autonomy, and effective coordination.

To improvise a solution, you have to be able to effectively use the tools and technologies that you have on hand that are directly available to you in this situation, what Claude Levi Strauss referred to as bricolage. That means you have to know what those tools are and you have to be skilled in their use. That’s the local expertise part. You’ll often need to leverage what David Woods calls a generic capability in order to solve the problem at hand. That’s some element of technology that wasn’t explicitly designed to do what you need, but is generic enough that you can use it.

Improvisation also requires that the people with the expertise have the authority to take required actions. They’re going to need the ability to do risky things, which could potentially end up making things worse. That’s the autonomy part.

Finally, because of the complex nature of incidents, you will typically need to work with multiple people to resolve things. It may be that you don’t have the requisite expertise or autonomy, but somebody else does. Or it may be that the improvised strategy requires coordination across a group of people. I remember one time when I was the incident commander where there was a problem that was affecting a large number of services and the only remediation strategy was to restart or re-deploy the affected services: we had to effectively “reboot the fleet”. The deployment tooling at the time didn’t support that sort of bulk activity, so we had to do it manually. A group of us, sitting in the war room (this was in pre-COVID days), divvied up the work of reaching out to all of the relevant service owners. We coordinated using Google sheets. (In general, I’m opposed to writing automation scripts during an incident if doing the task manually is just as quick, because the blast radius of that sort of script is huge, and those scripts generally don’t get tested well before use because of the urgency).

While we don’t know exactly what we’ll be called on to do during an incident, we can prepare to improvise. For more on this topic, check out Matt Davis’s piece on Site Reliability Engineering and the Art of Improvisation.