Imagine you’re being interviewed for a software engineering position, and the interviewer asks you: “Can you provide me with a list of the work items that you would do if you were hired here?” This is how the action item approach to incident retrospectives feels to me.
We don’t hire people based on their ability to come up with a set of work items. We’re hiring them for their judgment, their ability to make good engineering decisions and tradeoffs based on the problems that they will encounter at the company. In the interview process, we try to assess their expertise, which we assume they have developed based on their previous work experience.
Incidents provide us with excellent learning opportunities because they confront us with surprises. If we examine an incident in detail, we can learn something about our system behavior that we didn’t know before.
Yet, while we recognize the value of experienced candidates when we do hiring, we don’t seem to recognize the value of increasing the experience of our current employees. Incidents are a visceral type of experience, and reflecting on these sorts of experiences is what increases our expertise. But you have to reflect on them to maximize the value, and you have to share this information out to the organization so that it isn’t just the incident responders that can benefit from the experience.
To me, learning from incidents is about increasing the expertise of an organization by reflecting on and sharing out the experiences of surprising operational events. Action items are a dime a dozen. What I care about is improving the organization’s ability to engineer software.
Back when I was an engineering student, I wanted to know “How do the big companies develop software? How does it happen in the real world?”
Now that I work at a company that has to do large-scale software development, I understand better why it’s not something you can really teach effectively in a university setting. It’s not that companies doing large-scale software development are somehow better at writing software than companies that work on smaller-scale software projects. It’s that large-scale projects face challenges that small-scale projects don’t.
The biggest challenge at large-scale is coordination. My employer provides a single service, which means that, in theory, any project that anyone is working on inside of the company could potentially impact what anybody else is working on. In my specific case, I work on delivery tools, so we might be called upon to support some new delivery workflow.
You can take a top-down command-and-control style approach to the problem, by having the people at the top attempting to filter all of the information to just what they need, and them coordinating everyone hierarchically. However, this structure isn’t effective in dynamic environments: as the facts on the ground change, it takes too long for information to work its way up the hierarchy, adapt, and then change the orders downwards.
You can take a bottoms-up approach to the problem where you have a collection of teams that work autonomously. But the challenge there is getting them aligned. In theory, you hire people with good judgment, and provide them with the right context. But the problem is that there’s too much context! You can’t just firehose all of the available information to everyone, that doesn’t scale: everyone will spend all of their time reading docs. How do you get the information into the heads of the people that need it? becomes the grand challenge in this context.
It’s hard to convey the nature of this problem in a university classroom, if you’ve never worked in a setting like this before. The flurry of memos, planning documents, the misunderstandings, the sync meetings, the work towards alignment, the “One X” initiatives, these are all things that I had to experience viscerally, on a first-hand basis, to really get a sense of the nature of the problem.
During World War II, the British and the Americans were actively investing in developing radar technology in support of the war effort, with applications such as radar-based air defense. It turned out that developing the technology itself wasn’t enough: the new tech had to be deployed and operated effectively in the field to actually serve its purpose. Operating these systems required coordination between machines, human operators, and the associated institutions.
The British sent out scientists and engineers into the field to study and improve how these new systems were used. To describe this type of work, the physicist Albert Rowe coined the term operational research (OR), to contrast it with developmental research.
After the surprise attack on Pearl Harbor, the U.S. Secretary of War tapped the British radar pioneer Robert Wattson-Watt to lead a study on the state of American air defense systems. In the report, Wattson-Watt describe U.S. air defense as “insufficient organization applied to technically inadequate equipment used in exceptionally difficult conditions“. The report suggested the adoption of the successful British technology and techniques, which included OR.
At this time, the organization responsible for developmental research into weapons systems was the Office of Scientific Research and Development (OSRD), headed by Vannevar Bush. While a research function like OR seemed like it should belong under OSRD, there was a problem: Bush didn’t want it there. He wanted to protect the scientific development work of his organization from political interference by the military, and so he sought to explicitly maintain a boundary between the scientists and engineers that were developing the technology, and the military that was operating it.
[The National Defense Research Committee] is concerned with the development of equipment for military use, whereas these [OR] groups are concerned with the analysis of its performance, and the two points of view do not, I believe, often mix to advantage.
Vannevar Bush, quoted on p69
In the end, the demand for OR was too great, and Bush relented, creating the Office of Field Service within the OSRD.
Two things struck me reading this chapter. The first was that operational research was a kind of proto-DevOps, a recognition of the need to create a cultural shift in how development and operations work related to each other, and the importance of feedback between the two groups. It was fascinating to see the resistance to it from Bush. He wasn’t opposed to OR itself, he was opposed to unwanted government influence, which drove his efforts to keep development and operations separate.
The second thing that struck me was this idea of OR being doing research on operations. I had always thought of OR as being about things like logistics, basically graph theory problems addressed by faculty who work in business schools instead of computer science departments. But, here, OR was sending researchers into the field to study how operations was done in order to help improve development and operations. This reminded me very much of the aims of the learning from incidents (LFI) movement.
Managers rarely speak of objective criteria for achieving success because once certain crucial points in one’s career are passed, success and failure seem to have little to do with one’s accomplishments.
Almost all management books are prescriptive: they’re self-help books for managers. Moral Mazesis a very different kind of management book. Where most management books are written by management gurus, this one is written by a sociologist. This book is the result of a sociological study that the author conducted at three U.S. companies in the 1980s: a large textile firm, a chemical company, and a large public relations agency. He was interested in understanding the ethical decision-making process of American managers. And the picture he paints is a bleak one.
American corporations are organized into what Jackall calls patrimonial bureaucracies. Like the Prussian state, a U.S. company is organized as a hierarchy, with a set of bureaucratic rules that binds all of the employees. However, like a monarchy, people are loyal to individuals rather than offices. Effectively, it is a system of patronage, where leadership doles out privileges. Like in the court of King Louis XIV, factions within the organization jockey to gain favor.
With the exception of the CEO, all of the managers are involved in both establishing the rules of the game, and are bound by the rules. But, because the personalities of leadership play a strong role, and because leadership often changes over time, the norms are always contingent. When the winds change, the standards of behavior can change as well.
Managers are also in a tough spot because they largely don’t have control over the outcomes on which they are supposed to be judged. They are typically held responsible for hitting their numbers, but luck and timing play an enormous role over whether they are able to actually meet their objectives. As a result, managers are in a constant state of anxiety, since they are forever subject to the whims of fate. Failure here is socially defined, and the worst outcome is to be in the wrong place at the wrong time, and to have your boss say “you failed”.
Managers therefore focus on what they can control, which is the image they project. They put in a lot of hours, so they appear to be working hard. They strive to be seen as someone who is a team player, who behaves predictably and makes other managers feel comfortable. To stand out from their peers, they have to have the right style: the ability to relate to other people, to sell ideas, appear in command. To succeed in this environment, a manager needs social capital as well as the ability to adapt quickly as the environment changes.
Managers commonly struggle with decision making. Because the norms of behavior are socially defined, and because these norms change over time, they are forever looking to their peers to identify what the current norms are. Compounding the problem is the tempo of management work: because a manager’s daily schedule is typically filled with meetings and interrupts, with only fragmented views of problems being presented, there is little opportunity to gain a full view of problems and reflect deeply on them.
Making decisions is dangerous, and managers will avoid it when possible, even if this costs the organization in the long run. Jackall tells an anecdote about a large, old battery at a plant. The managers did not want to be on the hook for the decision to replace it, and so problems with it were patched up. Eventually, it failed completely, and the resulting cost to replace it and to deal with costs related to EPA violations and lawsuits was over $100M in 1979 dollars. And yet, this was still rational decision-making on behalf of the managers, because it was a risk for them in the short-term to make the decision to replace the battery.
Ethical decision making is particularly fraught here. Leadership wants success without wanting to be bothered with the messy details of how that success is achieved: a successful middle manager shields leadership from the details. Managers don’t have a professional ethic in the way that, say, doctors or lawyers do. Ethical guidelines are situational, they vary based on changing relationships. Expediency is a virtue, and a good manager is one who is pragmatic about decision making.
All moral issues are transmuted into practical concerns. Arguing based on morality rather than pragmatism is frowned upon, because moral arguments compel managers to act, and they need to be able to take stock of the social environment in order to judge whether a decision would be appropriate. Effective managers use social cues to help make decisions. They conform to what Jackall calls institutional logic: the ever-changing set of rules and incentives that the culture creates and re-creates to keep people’s perspectives and behaviors consistent and predictable.
There comes a time in every engineer’s career when you ask yourself, “do I want to go into management?” I’ve flirted with the idea in the past, but ultimately came down on the “no” side. After reading Moral Mazes, I’m more confident than ever that I made the right decision.
Here’s a scenario I frequently encounter: I’m working on writing up an incident or an OOPS. I’ve already interviewed key experts on the system, and based on those interviews, I understand the implementation details well enough to explain the failure mode in writing.
But, when I go to write down the explanation, I discover that I don’t actually have a good understanding of all of the relevant details. I could go back and ask clarifying questions, but I worry that I’ll have to do this multiple times, and I want to avoid taking up too much of other people’s time.
I’m now faced with a choice when describing the failure mode. I can either:
(a) Be intentionally vague about the parts that I don’t understand well.
(b) Make my best guess about the implementation details for the parts I’m not sure about.
Whenever I go with option (b), I always get some of the details incorrect. This becomes painfully clear to me when I show a draft to the key experts, and they tell me straight-out, “Lorin, this section is wrong.”
I call choosing option (b) taking the hit because, well, I hate the feeling of being wrong about something. However, I always try to go with this approach because this maximizes both my own learning and (hopefully) the learning of the readers. I take the hit. When you know that your work will be reviewed by an expert, it’s better to be clear and wrong than vague.
Imagine you’re walking around a university campus. It’s a couple of weeks after the spring semester has ended, and so there aren’t many people about. You enter a building and walk into one of the rooms. It appears to be some kind of undergraduate lab, most likely either physics or engineering.
In the lab, you come across a table. On the table, someone has balanced a rectangular block on its smallest end.
You nudge the top of the block. As expected, it falls over with a muted plonk. You look around to see if you might have gotten in trouble, but nobody’s around.
You come across another table. This table has some sort of track on it. The table also has a block on it that’s almost identical to the one on the other table, except that the block has a pin in it that connects it to some sort of box that is mounted on the track.
Not being able to resist, you nudge the top of this block, and it starts to fall. Then, the little box on the track whirs to life, moving along the track in the same direction that you nudged the block in. Because of the motion of the box, the block stays upright.
The ancient philosopher Aristotle believed that there were four distinct types of causes that explained why things happened. One of these is what Aristotle called the efficient cause: “Y behaved the way it did because X acted on Y“. For example, the red billiard ball moved because it was struck by the white ball. This is the most common way we think about causality today, and it’s sometimes referred to as linear causality.
Efficient cause does a good job of explaining the behavior of the first rectangular block in the anecdote: it fell over because we nudged it with our finger. But it doesn’t do a good job of explaining the observed behavior of the second rectangular block: we nudged it, and it started to fall, but it righted itself, and ended up balanced again.
Another type of cause Aristotle talked about was what he called the final cause. This is a teleological view of cause, which explains the behavior of objects in terms of their purpose or goal.
Final cause sounds like an archaic, unscientific view of the world. And, yet, reasoning with final cause enables us to explain the behavior of the second block more effectively than efficient cause does. That’s because the second block is being controlled by a negative-feedback system that was designed to keep the block balanced. The system will act to compensate for disturbances that could lead to the block falling over (like somebody walking over and nudging it). Because the output, a sensor that reads the angle of the block, is fed back into the input of the control system, the relationship between external disturbance and system behavior isn’t linear. This is sometimes referred to as circular causality, because of the circular nature of feedback loops.
The systems that we deal with contain goals and feedback loops, just like the inverted pendulum control system that keeps the block balanced. If you try to use linear causality to understand a closed-loop system, you will be baffled by the resulting behavior. Only when you understand the goals that the system is trying to achieve, and the feedback loops that it uses to adjust its behavior to reach those goals, will the resulting behavior make sense.
I’ve been on a bit of a control systems kick lately, and, serendipitously, I happened to see this tweet, which referenced a paper by Henry Yin at Duke University titled The crisis in neuroscience.
In the paper, Yin argues that neuroscience has failed to make progress in modeling human behavior because it tries to model the brain as a linear system, where you can study it by generating inputs and observing outputs.
Yin proposes an alternative model, that you need to view the brain as composed of a collection of hierarchical, closed-loop control systems in order to understand behavior from a neurological perspective.
Now, the cybernetics folks have long argued that you should model human brains as control systems. But Yin argues that the cyberneticists got an important thing wrong in their control models: their models were too close to engineering applications to be directly applicable to organisms.
In an engineered control system, a human operator specifies the set point. For example, for a cruise control system, you’d set desired speed. In the block diagram above, this set point is provided as the input to the system.
The output of the “Plant” block diagram is the current state of the variable you’re trying to control (e.g., current speed). The controller takes as input the difference between the set point and current state, and uses that to determine how to drive the plant (e.g., input to the motor).
Here’s a block diagram of everyone’s favorite control systems example, the thermostat:
I’ve used a double-arrow to indicate signals that propagate through the environment, and single-arrows to indicate signals that propagate through wires. I’ve put a red box around where Yin claims the cyberneticists hold as their model for control in animals.
The variable under control is the temperature. A human sets the desired temperature, and a temperature sensor reads the current temperature. The controller takes as input the difference between the desired temperature and the current temperature, and uses that to determine whether or not to turn on the furnace.
The actual temperature in the house is determined both by the output of the furnace, and by other factors (e.g., temperature outside, how good the insulation is, whether someone has opened a door), which I’ve labeled disturbance.
The problem with this, Yin argues, is that the red box is not a good model for the control that happens in the brain. As an alternative, he proposes the following model:
You can think of the red box as being the stuff inside some aspect of the brain, the “plant” as being the things that this aspect controls (e.g., other parts of the brain, muscles).
The difference in Yin’s model is that thecontroller determines the set point. There’s no external agent specifying the desired value as an input. Instead, the controller generates its own set point, which Yin calls the reference value.
Also note that Yin’s model includes the input function inside the red box. This takes sensory input at calculates the variable that’s under control. The difference between this model and the thermostat is that, in the thermostat model, you know from the outside that temperature is the variable being controlled. In Yin’s model, you can’t see from the outside what the variable is that’s being controlled for: the variable is internal to the control system.
Despite knowing nothing about neuroscience, and only knowing a bit about control systems, I still found this paper surprisingly accessible. I recommend it. There’s a lot more here than what I’ve touched on in this post.
There’s a wonderful book by the late urban planning professor Donald Schön called The Reflective Practitioner: How Professionals Think in Action. In the first chapter, he discusses the “rigor or relevance” dilemma that faces educators in professional degree programs. In the case of a university program aimed at preparing students for a career in software development, this is the “should we teach topological sort or React?” question.
Schön argues that the dilemma itself is a fundamental misunderstanding of the nature of professional work. What it misses is the ambiguity and uncertainty inherent in the work of professional life. The “rigor vs relevance” debate is an argument over the best way to get from the problem to the solution: do you teach the students first principles, or do you teach them how to use the current set of tools? Schön observes that a more significant challenge for professionals is defining the problems to solve in the first place, since an ill-defined problem admits no technical solution at all.
In the varied topography of professional practice, there is a high, hard ground where practitioners can make effective use of research-based theory and technique, and there is a swampy lowland where situations are confusing “messes” incapable of technical solution. The difficulty is that the problems of the high ground, however great their technical interest, are often relatively unimportant to clients or to the larger society, while in the swamp are the problems of greatest human concern.
Managers are not confronted with problems that are independent of each other, but with dynamic situations that consist of complex systems of changing problems that interact with each other. I call such situations messes. Problems are abstractions extracted from messes by analysis; they are to messes as atoms are to tables and chairs. We experience messes, tables, and chairs; not problems and atoms
To take another example from the software domain. Imagine that you’re doing quarterly planning, and there’s a collection of reliability work that you’d like to do, and you’re trying to figure out how to prioritize it. You could apply a rigorous approach, where you quantify some values in order to do the prioritization work, and so you try to estimate information like:
the probability of hitting a problem if the work isn’t done
the cost to the organization if the problem is encountered
the amount of effort involved in doing the reliability work
But you’re soon going to discover the enormous uncertainty involved in trying to put a number on any of those things. And, in fact, doing any reliability work can actually introduce new failure modes.
Over and over, I’ve seen the theme of ambiguity and uncertainty appear in ethnographic research that looks at professional work in action. In Designing Engineers, the aerospace engineering professor Louis Bucciarelli did an ethnographic study of engineers in a design firm, and discovered that the engineers all had partial understanding of the problem and solution space, and that their understandings also overlapped only partially. As a consequence, a lot of the engineering work that was done actually involved engineers resolving their incomplete understanding through various forms of communication, often informal. Remarkably, the engineers were not themselves aware of this process of negotiating understandings of the problems and solutions.
You’ll sometimes hear researchers who study work talk about the process of sensemaking. For example, there’s a paper by Sana Albolino, Richard Cook, and Micahel O’Connor called Sensemaking, safety, and cooperative work in the intensive care unit that describes this type of work in an intensive care unit. I think of sensemaking as an activity that professionals perform to try to resolve ambiguity and uncertainty.
(Ambiguity isn’t always bad. In the book On Line and On Paper, the sociologist Kathryn Henderson describes how engineers use engineering drawings as boundary objects. These are artifacts are that are understood differently by the different stakeholders: two engineers looking at the same drawing will have different mental models of the artifact based on their own domain expertise(!). However, there is also overlap in their mental models, and it is this combination of overlap and the fact that individuals can use the same artifact for different purposes that makes it useful. Here the ambiguity has actual value! In fact, her research shows that computer models, which eliminate the ambiguity, were less useful for this sort of work).
As practitioners, we have no choice: we always have to deal with ambiguity. As noted by Richard Cook in the quote that opens this blog post, we are the ones, at the sharp end, that are forced to resolve it.
Until now, if you wanted to improve your organization’s ability to learn from incidents, there wasn’t a lot of how-to style material you could draw from. Sure, there were research papers you could read (oh, so many research papers!). But academic papers aren’t a great source of advice for someone who is starting on an effort to improve how they do incident analysis.
There simply weren’t any publications about how to get started with doing incident investigations which were targeted at the infotech industry. Your best bet was the Etsy Debrief Facilitation Guide. It was practical, but it focused on only a single aspect of the incident investigation process: the group incident retrospective meeting. And there’s so much more to incident investigation than that meeting.
Readers of this blog will know that this is a topic near and dear to my heart. The name “Howie” is short for “How we got here“, which is what we call our incident writeups at Netflix. (This isn’t a coincidence: we came up with this name at Netflix when Nora Jones of Jeli and I were on the CORE team).
Writing a guide like this is challenging, because so much of incident investigation is contextual: what you look at it, what questions you ask, will depend on what you’ve learned so far. But there are also commonalities across all investigations; the central activities (constructing timelines, doing one-on-one interviews, building narratives) happen each time. The Howie guide gently walks the newcomer through these. It’s accessible.
When somebody says, “OK, I believe there’s value in learning more from incidents, and we want to go beyond doing a traditional root-cause-analysis. But what should I actually do?”, we now have a canonical answer: go read Howie.