Incident response is a team sport

Once upon a time, whenever I was involved in responding to an incident, and a teammate ended up diagnosing the failure mode, I would kick myself afterwards. How come I couldn’t figure out what was wrong? Why hadn’t I thought to do what they had done?

However, after enough exposure to the cognitive systems engineering literature, something finally clicked in my mind. When a group of people respond to an incident, it’s never the responsibility of a single individual to remediate. It can’t be, because we each know our own corners of the system better than our teammates. Instead, it is the responsibility of the group of incident responders as a whole to resolve the incident.

The group of incident responders, that ad-hoc team that forms in the moment, is what’s referred to as a joint cognitive system. It’s the responsibility of the individual responders to coordinate effectively so that the cognitive system can solve the problem. Often that involves dynamically distributing the workload so that individuals can focus on specific tasks.

Resolving incidents is a team effort. Go team!

You’re just going to sit there???

Here’s a little story about something that happened last year.

A paging alert fires for a service that a sibling team manages. I’m the support on-call, meaning that I answered support questions about the delivery engineering tooling. That means my only role here is to communicate with internal users about an ongoing issue. Since I don’t know this service at all, there isn’t much else for me to do: I’m just a bystander, watching the Slack messages from the sidelines.

The operations on-call he acknowledges the page and starts digging to figure out what’s gone wrong. As he’s investigating, he’s providing updates about his progress by posting Slack messages to the on-call channel. At one point, he types this message:

Anyway… we’re dead in the water until this figures itself out.

I’m… flabbergasted. He’s just going to sit there and hope that the system becomes healthy again on its own? He’s not even going to try and remediate? Much to my relief, after a few minutes, the service recovered.

Talking to him the next day, I discovered that he had taken a remediation action: he failed over a supporting service from the primary to the secondary. His comment was referring to the fact that the service was going to be down until the failover completed. Once the secondary became the new primary, things went back to normal.

When I looked back at the Slack messages, I noticed that he had written messages to communicate that he was failing over the primary. But he had also mentioned that his initial attempt at failover didn’t work, as the operational UX was misleading. What happened was that I had misinterpreted the Slack message. I thought his attempt to fail over had simply failed entirely, and he was out of ideas.

Communicating effectively over Slack during a high-tempo event like an incident is challenging. It can be especially difficult if you don’t have a prior working relationship with the people in the ad-hoc incident response team, which can happen when an incident spans multiple teams. Getting better at communicating during an incident is a skill, both for individuals and organizations as a whole. It’s one I think we don’t pay enough attention to.

Imagine there’s no human error…

When an incident happens, one of the causes is invariably identified as human error: somebody along the way made a mistake, did something they shouldn’t have done. For example: that engineer shouldn’t have done that clearly risky deployment and then walked away without babysitting it. Labeling an action as human error is an unfortunately effective way at ending an investigation (root cause: human error).

Some folks try to make progress on the current status quo by arguing that, since human error is inevitable (people make mistakes!), it should be the beginning of the investigation, rather than the end. I respect this approach, but I’m going to take a more extreme view here: we can gain insight into how incidents happen, even those that involve operator actions as contributing factors, without reference to human error at all.

Since we human beings are physical beings, you can think of us as machines. Specifically, we are machines that make decisions and take action based on those decisions. Now, imagine that every decision we make involves our brain trying to maximize a function: when provided with a set of options, it picks the one that has the largest value. Let’s call this function g, for goodness.

Pushing button A has a higher value of g than pushing button B.

(The neuroscientist Karl Friston has actually proposes something similar as a theory: organisms make decisions to minimize model surprise, a construct that Friston calls free energy).

In this (admittedly simplistic) model of human behavior, all decision making is based on an evaluation of g. Each person’s g will vary based on their personal history and based on their current context: what they currently see and hear, as well as other factors such as time pressure and conflicting goals. “History” here is very broad, as g will vary based not only on what you’ve learned in the past, but also on physiological factors like how much sleep you had last night and what you ate for breakfast.

Under this paradigm, if one of the contributing factors in an incident was the user pushing “A” instead of “B”, we ask “how did the operator’s g function score a higher value for pushing A over B”? There’s no concept of “error” in this model. Instead, we can explore the individual’s context and history to get a better understanding of how their g function valued A over B. We accomplish this by talking to them.

I think the model above is much more fruitful than the one where we identify errors or mistakes. In this model, we have to confront the context and the history that a person was exposed to, because those are the factors that determine how decisions get made.

The idea of human error is a hard thing to shake. But I think we’d be better off if we abandoned it entirely.

Some additional reading on the idea of human error:

She Blinded Me With Science: A review of Galileo’s Middle Finger

The science we are taught in school is nice and neat. However, the realities of scientific research, like all human endeavors, is messy, and has its share of controversies. There are two flavors of scientific controversy. There’s the political type of controversy, where people who are not part of the scientific community feel very strongly about the implications of the scientific theories: think climate change, or the Scopes Trial. Then there are controversies within a scientific community about theories. For example, the theory of plate tectonics was so controversial among geologists when it was proposed that it was considered pseudo-science.

Alice Dreger plants herself firmly in the intersection of political and scientific controversy. The book is a first-hand account of her experiences as an activist among various episodes of controversy. Here she’s defending an anthropologist from false accusations of deliberately harming the native Yanomamö people of South America, there she’s crusading against a medical researcher treating pregnant women with an off-label drug, as part of experimental research, without properly gathering informed consent.

The tragedy is that Dreger, a trained historian, isn’t able to tell a story effectively. Reading the book feels like listening to a teenager recounting interpersonal dramas going on at school. Her style is a strict linear account of the events from her perspective, but that doesn’t help the reader make sense of the events that’s going on. It’s too much chronology rather than narrative: “this happened, then that happened, then the other thing happened.” She loses the forest for the trees.

The result is a book about a fascinating topic, scientific controversies that intersect with politics, turns out to be a slog.

Bad Religion: A review of Work Pray Code

When I worked as a professor at the University of Nebraska—Lincoln, after being there for a few months, during a conversation with the chair of the computer science department he asked me “have you found a church community yet?” I had not. I had, however, found a synagogue. The choice wasn’t difficult: there were only two. Nobody asked me a question like that after I moved to San Jose, which describes itself as the heart of Silicon Valley.

Why is Silicon Valley so non-religious is the question that sociologist Carolyn Chen seeks to answer here. As a tenured faculty member at UC Berkeley, Chen is a Bay Area resident herself. Like so many of us here, she’s a transplant: she grew up in Pennsylvania and Southern California, and first moved to the area in 2013 to do research on Asian religions in secular spaces.

Chen soon changed the focus of her research from Asian religions to the work culture of tech companies. She observes that people tend to become less religious when they move to the area, and are less engaged in their local communities. Tech work is totalizing, absorbing employees entire lives. Tech companies care for many of the physical needs of their employees in a way that companies in other sectors do not. Tech companies provide meditation/mindfulness (the companies use these terms interchangeably) to help their employees stay productive, but it is a neutered version of the meditation of its religious, Buddhist roots. Tech companies push up the cost of living, and provide private substitutes for public infrastructure, like shuttle busses.

Chen tries to weave these threads together into a narrative about how work substitutes for religion in the lives of tech workers in Silicon Valley. But the pieces just don’t fit together. Instead, they feel shoehorned in to support her thesis. And that’s a shame, because, as a Silicon Valley tech worker, many of the observations themselves ring true to my personal experience. Unlike Nebraska, Silicon Valley really is a very secular place, so much so that it was a plot point in an episode of HBO’s Silicon Valley. As someone who sends my children to religious school, I’m clearly in the minority at work. My employer provides amenities like free meals and shuttles. They even provide meditation rooms, access to guided meditations provided by the Mental Health Employee Resource Group, and subscriptions to the Headspace meditation app. The sky-high cost of living in Silicon Valley is a real problem for the area.

But Chen isn’t able to make the case that her thesis is the best explanation for this grab bag of observations. And her ultimate conclusion, that tech companies behave more and more like cults, just doesn’t match my own experiences working at a large tech company in Silicon Valley.

Most frustratingly, Chen doesn’t ever seem to ask the question, “are there other domains where some of these observations also hold?” Because so much of the description of the secular and insular nature of Silicon Valley tech workers applies to academics, the culture that Chen herself is immersed in!

Take this excerpt from Chen:

Workplaces are like big and powerful magnets that attract the energy of individuals away from weaker magnets such as families, religious congregations, neighborhoods, and civic associations—institutions that we typically associate with “life” in the “work-life” binary. The magnets don’t “rob” or “extract”—words that we use to describe labor exploitation. Instead they attract the filings, monopolizing human energy by exerting an attractive rather than extractive force. By creating workplaces that meet all of life’s needs, tech companies attract the energy and devotion people would otherwise devote to other social institutions, ones that, traditionally and historically, have been sources of life fulfillment.

Work Pray Code, p197

Compare this to an excerpt from a very different book: Robert Sommer’s sardonic 1963 book Expertland (sadly, now out of print), which describes itself as “an unrestricted inside view of the world of scientists, professors, consultants, journals, and foundations, with particular attention to the quaint customs, distinctive dilemmas, and perilous prospects”.

Experts know very few real people. Except for several childhood friends or close relatives, the expert does not know anybody who drives a truck, runs a grocery store, or is vice-president of the local Chamber of Commerce. His only connection with these people is in some kind of service relationship; they are not his friends, colleagues, or associates. The expert feel completely out of place at Lion’s or Fish and Game meeting. If he is compelled to attend such gatherings, he immediately gravitates to any other citizen of Expertland who is present… He has no roots, no firm allegiances, and nothing to gain or lose in local elections… Because he doesn’t vote in local elections, join service clubs, or own the house he lives in, outsiders often feel that the expert is not a good citizen.

Expertland pp 2-3

Chen acknowledges that work is taking over the lives of all high-skilled professionals, not just tech workers. But I found work-life balance to be much worse in academia than at a Silicon Valley tech company! To borrow a phrase from the New Testament, And why beholdest thou the mote that is in thy brother’s eye, but considerest not the beam that is in thine own eye?

We value possession of experience, but not its acquisition

Imagine you’re being interviewed for a software engineering position, and the interviewer asks you: “Can you provide me with a list of the work items that you would do if you were hired here?” This is how the action item approach to incident retrospectives feels to me.

We don’t hire people based on their ability to come up with a set of work items. We’re hiring them for their judgment, their ability to make good engineering decisions and tradeoffs based on the problems that they will encounter at the company. In the interview process, we try to assess their expertise, which we assume they have developed based on their previous work experience.

Incidents provide us with excellent learning opportunities because they confront us with surprises. If we examine an incident in detail, we can learn something about our system behavior that we didn’t know before.

Yet, while we recognize the value of experienced candidates when we do hiring, we don’t seem to recognize the value of increasing the experience of our current employees. Incidents are a visceral type of experience, and reflecting on these sorts of experiences is what increases our expertise. But you have to reflect on them to maximize the value, and you have to share this information out to the organization so that it isn’t just the incident responders that can benefit from the experience.

To me, learning from incidents is about increasing the expertise of an organization by reflecting on and sharing out the experiences of surprising operational events. Action items are a dime a dozen. What I care about is improving the organization’s ability to engineer software.

Software engineering in-the-large: the coordination challenge

Back when I was an engineering student, I wanted to know “How do the big companies develop software? How does it happen in the real world?”

Now that I work at a company that has to do large-scale software development, I understand better why it’s not something you can really teach effectively in a university setting. It’s not that companies doing large-scale software development are somehow better at writing software than companies that work on smaller-scale software projects. It’s that large-scale projects face challenges that small-scale projects don’t.

The biggest challenge at large-scale is coordination. My employer provides a single service, which means that, in theory, any project that anyone is working on inside of the company could potentially impact what anybody else is working on. In my specific case, I work on delivery tools, so we might be called upon to support some new delivery workflow.

You can take a top-down command-and-control style approach to the problem, by having the people at the top attempting to filter all of the information to just what they need, and them coordinating everyone hierarchically. However, this structure isn’t effective in dynamic environments: as the facts on the ground change, it takes too long for information to work its way up the hierarchy, adapt, and then change the orders downwards.

You can take a bottoms-up approach to the problem where you have a collection of teams that work autonomously. But the challenge there is getting them aligned. In theory, you hire people with good judgment, and provide them with the right context. But the problem is that there’s too much context! You can’t just firehose all of the available information to everyone, that doesn’t scale: everyone will spend all of their time reading docs. How do you get the information into the heads of the people that need it? becomes the grand challenge in this context.

It’s hard to convey the nature of this problem in a university classroom, if you’ve never worked in a setting like this before. The flurry of memos, planning documents, the misunderstandings, the sync meetings, the work towards alignment, the “One X” initiatives, these are all things that I had to experience viscerally, on a first-hand basis, to really get a sense of the nature of the problem.

OR, DevOps, and LFI

I’m currently reading Systems, Experts, and Computers: the Systems Approach in Management and Engineering, World War II and After. The book is really more of a collection of papers, each written by a different author. This post is about the second chapter, The Adoption of Operations Research in the United States During World War II, written by Erik Rau.

During World War II, the British and the Americans were actively investing in developing radar technology in support of the war effort, with applications such as radar-based air defense. It turned out that developing the technology itself wasn’t enough: the new tech had to be deployed and operated effectively in the field to actually serve its purpose. Operating these systems required coordination between machines, human operators, and the associated institutions.

The British sent out scientists and engineers into the field to study and improve how these new systems were used. To describe this type of work, the physicist Albert Rowe coined the term operational research (OR), to contrast it with developmental research.

After the surprise attack on Pearl Harbor, the U.S. Secretary of War tapped the British radar pioneer Robert Wattson-Watt to lead a study on the state of American air defense systems. In the report, Wattson-Watt describe U.S. air defense as “insufficient organization applied to technically inadequate equipment used in exceptionally difficult conditions“. The report suggested the adoption of the successful British technology and techniques, which included OR.

At this time, the organization responsible for developmental research into weapons systems was the Office of Scientific Research and Development (OSRD), headed by Vannevar Bush. While a research function like OR seemed like it should belong under OSRD, there was a problem: Bush didn’t want it there. He wanted to protect the scientific development work of his organization from political interference by the military, and so he sought to explicitly maintain a boundary between the scientists and engineers that were developing the technology, and the military that was operating it.

[The National Defense Research Committee] is concerned with the development of equipment for military use, whereas these [OR] groups are concerned with the analysis of its performance, and the two points of view do not, I believe, often mix to advantage.

Vannevar Bush, quoted on p69

In the end, the demand for OR was too great, and Bush relented, creating the Office of Field Service within the OSRD.

Two things struck me reading this chapter. The first was that operational research was a kind of proto-DevOps, a recognition of the need to create a cultural shift in how development and operations work related to each other, and the importance of feedback between the two groups. It was fascinating to see the resistance to it from Bush. He wasn’t opposed to OR itself, he was opposed to unwanted government influence, which drove his efforts to keep development and operations separate.

The second thing that struck me was this idea of OR being doing research on operations. I had always thought of OR as being about things like logistics, basically graph theory problems addressed by faculty who work in business schools instead of computer science departments. But, here, OR was sending researchers into the field to study how operations was done in order to help improve development and operations. This reminded me very much of the aims of the learning from incidents (LFI) movement.

Prussia meets Versailles: a review of Moral Mazes

Managers rarely speak of objective criteria for achieving success because once certain crucial points in one’s career are passed, success and failure seem to have little to do with one’s accomplishments.

p44

Almost all management books are prescriptive: they’re self-help books for managers. Moral Mazes is a very different kind of management book. Where most management books are written by management gurus, this one is written by a sociologist. This book is the result of a sociological study that the author conducted at three U.S. companies in the 1980s: a large textile firm, a chemical company, and a large public relations agency. He was interested in understanding the ethical decision-making process of American managers. And the picture he paints is a bleak one.

American corporations are organized into what Jackall calls patrimonial bureaucracies. Like the Prussian state, a U.S. company is organized as a hierarchy, with a set of bureaucratic rules that binds all of the employees. However, like a monarchy, people are loyal to individuals rather than offices. Effectively, it is a system of patronage, where leadership doles out privileges. Like in the court of King Louis XIV, factions within the organization jockey to gain favor.

With the exception of the CEO, all of the managers are involved in both establishing the rules of the game, and are bound by the rules. But, because the personalities of leadership play a strong role, and because leadership often changes over time, the norms are always contingent. When the winds change, the standards of behavior can change as well.

Managers are also in a tough spot because they largely don’t have control over the outcomes on which they are supposed to be judged. They are typically held responsible for hitting their numbers, but luck and timing play an enormous role over whether they are able to actually meet their objectives. As a result, managers are in a constant state of anxiety, since they are forever subject to the whims of fate. Failure here is socially defined, and the worst outcome is to be in the wrong place at the wrong time, and to have your boss say “you failed”.

Managers therefore focus on what they can control, which is the image they project. They put in a lot of hours, so they appear to be working hard. They strive to be seen as someone who is a team player, who behaves predictably and makes other managers feel comfortable. To stand out from their peers, they have to have the right style: the ability to relate to other people, to sell ideas, appear in command. To succeed in this environment, a manager needs social capital as well as the ability to adapt quickly as the environment changes.

Managers commonly struggle with decision making. Because the norms of behavior are socially defined, and because these norms change over time, they are forever looking to their peers to identify what the current norms are. Compounding the problem is the tempo of management work: because a manager’s daily schedule is typically filled with meetings and interrupts, with only fragmented views of problems being presented, there is little opportunity to gain a full view of problems and reflect deeply on them.

Making decisions is dangerous, and managers will avoid it when possible, even if this costs the organization in the long run. Jackall tells an anecdote about a large, old battery at a plant. The managers did not want to be on the hook for the decision to replace it, and so problems with it were patched up. Eventually, it failed completely, and the resulting cost to replace it and to deal with costs related to EPA violations and lawsuits was over $100M in 1979 dollars. And yet, this was still rational decision-making on behalf of the managers, because it was a risk for them in the short-term to make the decision to replace the battery.

Ethical decision making is particularly fraught here. Leadership wants success without wanting to be bothered with the messy details of how that success is achieved: a successful middle manager shields leadership from the details. Managers don’t have a professional ethic in the way that, say, doctors or lawyers do. Ethical guidelines are situational, they vary based on changing relationships. Expediency is a virtue, and a good manager is one who is pragmatic about decision making.

All moral issues are transmuted into practical concerns. Arguing based on morality rather than pragmatism is frowned upon, because moral arguments compel managers to act, and they need to be able to take stock of the social environment in order to judge whether a decision would be appropriate. Effective managers use social cues to help make decisions. They conform to what Jackall calls institutional logic: the ever-changing set of rules and incentives that the culture creates and re-creates to keep people’s perspectives and behaviors consistent and predictable.

There comes a time in every engineer’s career when you ask yourself, “do I want to go into management?” I’ve flirted with the idea in the past, but ultimately came down on the “no” side. After reading Moral Mazes, I’m more confident than ever that I made the right decision.