StaffEng podcast

I had fun being a guest on the StaffEng podcast.

What’s allowed to count as a cause?

Imagine a colleague comes to you and says, “I’m doing the writeup for a recent incident. I have to specify causes, but I’m not sure which ones to put. Which of these do you think I should go with?”

Engineer entered incorrect configuration value. The engineer put in the wrong config value, which led to the critical foo system to return error responses.
Non-actionable alerts. The engineer who specified the config had just come off an on-call shift where they had to deal with multiple alerts that fired the night before. All of those alerts turned out to be non-actionable. The engineer was tired the day they put in the configuration change. Had the engineer not had to deal with these alerts, they would have been sharper the next day, and likely would have spotted the config problem.
Work prioritization. An accidentally incorrect configuration value was a known risk to the team, and they had been planning to build in some additional verification to guard against these sorts of configuration values. But this work was de-prioritized in favor of work that supported a high-priority feature to the business, which involved coordination across multiple teams. Had the work not been de-prioritized, there would have been guardrails in place that would have prevented the config change from taking down the system.
Power dynamics. The manager of the foo team had asked leadership for additional headcount, to enable the team to do both the work that was high priority to the business and to work on addressing known risks. However, the request was denied, and the available headcount was allocated to other teams, based on the perceived priorities of the business. If the team manager had had more power in the org, they would have been able to acquire the additional resources and address the known risks.

There’s a sense in which all of these can count as causes. If any of them weren’t present, the incident wouldn’t have happened. But we don’t see them the same way. I can guarantee that you’re never going to see power dynamics listed as a cause in an incident writeup, public or internal.

The reason is not that “incorrect configuration value” is somehow objectively more causal than power dynamics. Rather, the sorts of things that are allowed to be labelled as causes depends on the cultural norms of an organization. This is what people mean when they say that causes are socially constructed.

And who gets to determine what’s allowed to be labelled as a cause and what isn’t is itself a property of power dynamics. Because the things that are allowed to be called causes are things that an organization is willing to label as a problem, which means that it’s something that can receive organizational attention and resources in order to be addressed.

Remember this the next time you identify a contributing factor in an incident and somebody responds with, “that’s not why the incident happened.” That isn’t an objective statement of fact. It’s a value judgment about what’s permitted to be identified as a cause.

The greedy exec trap

Just listening to this experience was so powerful. It taught me to challenge the myth of “commercial pressure”. We tend to think that every organizational problem is the result of cost-cutting. Yet, the cost of a new drain pump was only $90… [that’s] nothing when you are running a ship. As it turned out, the purchase order had landed in the wrong department.
Nippin Anand, Deep listening — a personal journey

Whenever we read about a public incident, a common pattern in the reporting is that the organization under-invested in some area, and that can explain why the incident happened. “If only the execs hadn’t been so greedy”, we think, “if they had actually invested some additional resources, this wouldn’t have happened!”

What was interesting as well around this time is that when the chemical safety board looked at this in depth, they found a bunch of BP organizational change. There were a whole series of reorganizations of this facility over the past few years that had basically disrupted the understandings of who was responsible for safety and what was that responsibility. And instead of safety being some sort of function here, it became abstracted away into the organization somewhere. A lot of these conditions, this chained-closed outlet here, the failure of the sensors, problems with the operators and so on… all later on seemed to be somehow brought on by the rapid rate of organizational change. And that safety had somehow been lost in the process.
Richard Cook, Process tracing, Texas City BP explosion, 2005, Lectures on the study of cognitive work

Production pressure is an ever-present risk. his is what David Woods calls faster/better/cheaper pressure, a nod to NASA policy. In fact, if you follow the link to the Richard Cook lecture, right after talking about the BP reorgs, he discusses the role of production pressure in the Texas City BP explosion. However, production pressure is never the whole story.

Remember the Equifax breach that happened a few years ago? Here’s a brief timeline I extracted from the report:

Day 0: (3/7/17) Apache Struts vulnerability CVE-2017-5638 is publicly announced
Day 1: (3/8/17) US-CERT sends an alert to Equifax about the vulnerability
Day 2: (3/9/17) Equifax’s Global Threat and Vulnerability Management (GTVM) team posts to an internal mailing list about the vulnerability and requests that app owners should patch within 48 hours
???
Day 37: (4/13/17) Attackers exploit the vulnerability in the ACIS app

The Equifax vulnerability management team sent out a notification about the Struts vulnerability a day after they received notice about it. But, as in the two cases above, something got lost in the system. What I wondered reading the report was: How did that notification get lost? Were the engineers who operate the ACIS app not on that mailing list? Did they receive the email, but something kept them from acting on it? Perhaps there was nobody responsible for security patches of the app at the time the notification went out? Maddeningly, the report doesn’t say. After reading that report, I still feel like I don’t understand what it was about the system that enabled the notification to get lost.

It’s so easy to explain an incident by describing how management could have prevented it from investing additional resources. This is what Nippin Anand calls the myth of commercial pressure. It’s all too satisfying for us to identify short-sighted management decisions as the reason that an incident happened.

I’m calling this tendency the greedy exec trap, because once we identify the cause of an incident as greedy executives, we stop asking questions. We already know why the incident happened, so what more do we need to look into? What else is there to learn, really?

Trouble during startup

I asked this question on twitter today:

Have you ever encountered an incident where one of the contributing factors was service behavior on startup?
— @norootcause@hachyderm.io on mastodon (@norootcause) July 11, 2021

I received a lot of great responses.

My question was motivated by this lecture by Dr. Richard Cook about an explosion at a BP petroleum processing plant in Texas in 2005 that killed fifteen plant workers. Here’s a brief excerpt, starting at 16:31 of the video:

Let me make one other observation about this that I think is important, which is that this occurred during startup. That is, once these processes get going, they work in a way that’s different than starting them up. So starting up the process requires a different set of activities than running it continuously. Once you have it running continuously, you can be pouring stuff in one end and getting it out the other, and everything runs smoothly in-between. But startup doesn’t, it doesn’t have things in it, so you have to prime all the pumps by doing a different set of operations.

This is yet another reminder of how similar we are to other fields that control processes.

Controlling a process we don’t understand

I was attending the Resilience Engineering Association – Naturalistic Decision Making Symposium last month, and one of the talks was by a medical doctor (an anesthesiologist) who was talking about analyzing incidents in anesthesiology. I immediately thought of Dr. Richard Cook, who is also an anesthesiologist, who has been very active in the field of resilience engineering, and I wondered, “what is it with anesthesiology and resilience engineering?” And then it hit me: it’s about process control.

As software engineers in the field we call “tech”, we often discuss whether we are really engineers in the same sense that a civil engineer is. But, upon reflection I actually think that’s the wrong question to ask. Instead, we should consider the fields there where practitioners are responsible for controlling a dynamic process that’s too complex for humans to fully understand. This type of work involves fields such as spaceflight, aviation, maritime, chemical engineering, power generation (nuclear power in particular), anesthesiology, and, yes, operating software services in the cloud.

We all have displays to look at to tell us the current state of things, alerts that tell us something is going wrong, and knobs that we can fiddle with when we need to intervene in order to bring the process back into a healthy state. We all feel production pressure, are faced with ambiguity (is that blip really a problem?), are faced with high-pressure situations, and have to make consequential decisions under very high degrees of uncertainty.

Whether we are engineers or not doesn’t matter. We’re all operators doing our best to bring complex systems under our control. We face similar challenges, and we should recognize that. That is why I’m so fascinated by fields like cognitive systems engineering and resilience engineering. Because it’s so damned relevant to the kind of work that we do in the world of building and operating cloud services.

Month: July 2021