Resilience requires helping each other out

A common failure mode in complex systems is that some part of the system hits a limit and falls over. In the software world, we call this phenomenon resource exhaustion, and a classic example of this is running out of memory.

The simplest solution to this problem is to “provision for peak”: to build out the system so that it always has enough resources to handle the theoretical maximum load. Alas, this solution isn’t practical: it’s too expensive. Even if you manage to overprovision the system, over time, it will get stretched to its limits. We need another way to mitigate the risk of overload.

Fortunately, it’s rare for every component of a system to reach its limit simultaneously: while one component might get overloaded, there are likely other components that have capacity to spare. That means that if one component is in trouble, it can borrow resources from another one.

Indeed, we see this sort of behavior in biological systems. In the paper Allostasis: A Model of Predictive Regulation, the neuroscientist Peter Sterling explains why allostasis is a better theory than homeostasis. Readers are probably familiar with the term homeostasis: it refers to how your body maintains factors in a narrow range, like keeping your body temperature around 98.6°F. Allostasis, on the other hand, is about how your body predicts where these sorts of levels should be, based on anticipated need. Your body then takes action to modify the current state of these levels. Here’s Sterling explaining why he thinks allostasis is superior, referencing the idea of borrowing resources across organs (emphasis mine)

A second reason why homeostatic control would be inefficient is that if each organ self-regulated independently, opportunities would be missed for efficient trade-offs. Thus each organ would require its own reserve capacity; this would require additional fuel and blood, and thus more digestive capacity, a larger heart, and so on – to support an expensive infrastructure rarely used. Efficiency requires organs to trade-off resources, that is, to grant each other short-term loans.

The systems we deal with are not individual organisms, but organizations that are made up of groups of people. In organization-style systems, this sort of resource borrowing becomes more complex. Incentives in the system might make me less inclined to lend you resources, even if doing so would lead to better outcomes for the overall system. In his paper The Theory of Graceful Extensibility: Basic rules that govern adaptive systems, David Woods borrows the term reciprocity from Elinor Ostrom to describe this property in a system of one agent being willing to lend resources to another as a necessary ingredient for resilience (emphasis mine):

Will the neighboring units adapt in ways that extend the [capacity for maneuver] of the adaptive unit at risk? Or will the neighboring units behave in ways that further constrict the [capacity for maneuver] of the adaptive unit at risk? Ostrom (2003) has shown that reciprocity is an essential property of networks of adaptive units that produce sustained adaptability.

I couldn’t help thinking of the Sterling and Woods papers when reading the latest issue of Nat Bennett’s Simpler Machines newsletter, titled What was special about Pivotal? Nat’s answer is reciprocity:

This isn’t always how it went at Pivotal. But things happened this way enough that it really did change people’s expectations about what would happen if they co-operated – in the game theory, Prisoner’s Dilemma sense. Pivotal was an environment where you could safely lead with co-operation. Folks very rarely “defected” and screwed you over if you led by trusting them.

People helped each other a lot. They asked for help a lot. We solved a lot of problems much faster than we would have otherwise, because we helped each other so much. We learned much faster because we helped each other so much.

And it was generally worth it to do a lot of things that only really work if everyone’s consistent about them. It was worth it to write tests, because everyone did. It was worth it to spend time fixing and removing flakes from tests, because everyone did. It was worth it to give feedback, because people changed their behavior. It was worth it to suggest improvements, because things actually got better.

There was a lot of reciprocity.

Nat’s piece is a good illustration of the role that culture plays in enabling a resilient organization. I suspect it’s not possible to impose this sort of culture, it has to be fostered. I wish this were more widely appreciated.

Treating uncertainty as a first-class concern

One of the things that complex incidents have in common is the uncertainty that the responders experience while the incident is happening. Something is clearly wrong, that’s why an incident is happening. But it’s hard to make sense of the failure mode, on what precisely the problem is, based on the signals that we can observe directly.

Eventually, we figure out what’s going on, and how to fix it. By the time the incident review rolls around, while we might not have a perfect explanation for every symptom that we witnessed during the incident, we understand what happened well enough that the in-the-moment uncertainty is long gone.

Cooperative Advocacy: An Approach for Integrating Diverse Perspectives in Anomaly Response is a paper by Jennifer Watts-Englert and David Woods that compares two incidents involving NASA space shuttle missions: one successful and the other tragic. The paper discusses strategies for dealing with the authors call anomaly response, when something unexpected has happened. The authors describe a process they observed which they call Cooperative Advocacy for effectively dealing with uncertainty during an incident. They document how cooperative advocacy was applied in the successful NASA mission, and how it was not applied in the failed case.

It’s a good paper (it’s on my list!), and SREs and anyone else who deals with incidents will find it highly relevant to their work. For example, here’s a quote from the paper that I immediately connected with:

For anomaly response to be robust given all of the difficulties and complexities that can arise, all discrepancies must be treated as if they are anomalies to be wrestled with until their implications are understood (including the implications of being uncertain or placed in a difficult trade-off position). This stance is a kind of readiness to re-frame and is a basic building block for other aspects of good process in anomaly response. Maintaining this as a group norm is very difficult because following up on discrepancies consumes resources of time, workload, and expertise. Inevitably, following up on a discrepancy will be seen as a low priority for these resources when a group or organization operates under severe workload constraints and under increasing pressure to be “faster, better, cheaper”.

(See one of my earlier blog posts, chasing down the blipperdoodles).

But the point of this blog post isn’t to summarize this specific paper. Rather, it’s to call attention to the fact that anomaly response as a problem that we will face over and over again. Too often, we dismiss the anomaly we just faced in an incident as a weird, one-off occurrence. And while that specific failure mode likely will be a one-off, we’ll be faced with new anomalies in the future.

This paper treats anomaly response as a first-class entity, as a thing we need to worry about on an ongoing basis, as something we need to be able to get better at. We should do the same.

Missing the forest for the trees: the component substitution fallacy

Here’s a brief excerpt from a talk by David Woods on what he calls the component substitution fallacy (emphasis mine):

claim of root cause is ex. of component substitution fallacy. All incidents that threaten failure reveal component weaknesses due to finite
resources & tradeoffs -> easy to miss the critical systemic/emergent factors see min 25 https://t.co/OsYy2U8fsA
— David Woods (@ddwoods2) January 16, 2023

Everybody is continuing to commit the component substitution fallacy.

Now, remember, everything has finite resources, and you have to make trade-offs. You’re under resource pressure, you’re under profitability pressure, you’re under schedule pressure. Those are real, they never go to zero.

So, as you develop things, you make trade offs, you prioritize some things over other things. What that means is that when a problem happens, it will reveal component or subsystem weaknesses. The trade offs and assumptions and resource decisions you made guarantee there are component weaknesses. We can’t afford to perfect all components.

Yes, improving them is great and that can be a lesson afterwards, but if you substitute component weaknesses for the systems-level understanding of what was driving the event … at a more fundamental level of understanding, you’re missing the real lessons.

Seeing component weaknesses is a nice way to block seeing the system properties, especially because this justifies a minimal response and avoids any struggle that systemic changes require.
Woods on Shock and Resilience (25:04 mark)

Whenever an incident happens, we’re always able to point to different components in our system and say “there was the problem!” There was a microservice that didn’t handle a certain type of error gracefully, or there was bad data that had somehow gotten past our validation checks, or a particular cluster was under-resourced because it hadn’t been configured properly, and so on.

These are real issues that manifested as an outage, and they are worth spending the time to identify and follow up on. But these problems in isolation never tell the whole story of how the incident actually happened. As Woods explains in the excerpt of his talk above, because of the constraints we work under, we simply don’t have the time to harden the software we work on to the point where these problems don’t happen anymore. It’s just too expensive. And so, we make tradeoffs, we make judgments about where to best spend our time as we build, test, and roll out our stuff. The riskier we perceive a change, the more effort we’ll spend on validation and rollout of the change.

And so, if we focus only on issues with individual components, there’s so much we miss about the nature of failure in our systems. We miss looking at the unexpected interactions between the components that enabled the failure to happen. We miss how the organization’s prioritization decisions enabled the incident in the first place. We also don’t ask questions like “if we are going to do follow-up work to fix the component problems revealed by this incident, what are the things that we won’t be doing because we’re prioritizing this instead?” or “what new types of unexpected interactions might we be creating by making these changes?” Not to mention incident-handling questions like “how did we figure out something was wrong here?”

In the wake of an incident, if we focus only on the weaknesses of individual components then we won’t see the systemic issues. And it’s the systemic will continue to bite us long after we’ve implemented all of those follow-up action items. We’ll never see the forest for the trees.

Good category, bad category (or: tag, don’t bucket)

The baby, assailed by eyes, ears, nose, skin, and entrails at once, feels it all as one great blooming, buzzing confusion; and to the very end of life, our location of all things in one space is due to the fact that the original extents or bignesses of all the sensations which came to our notice at once, coalesced together into one and the same space.
William James, The Principles of Psychology (1890)

I recently gave a talk at the Learning from Incidents in Software conference. On the one hand, I mentioned the importance of finding patterns in incidents:

But I also had some… unkind words about categorizing incidents.

We humans need categories to make sense of the world, to slice it up into chunks that we can handle cognitively. Otherwise, the world would just be, as James put it in the quote above, one great blooming, buzzing confusion. So, categorization is essential to humans functioning in the world. In particular, if we want to identify meaningful patterns in incidents, we need to do categorization work.

But there are different techniques we can use to categorize incidents, and some techniques are more productive than others.

The buckets approach

An incident must be placed in exactly one bucket

One technique is what I’ll call the buckets approach of categorization. That’s when you define a set up of categories up front, and then you assign each incident that happens to exactly one bucket. For example, you might have categories such as:

bug (new)
bug (latent)
deployment problem
third party

I have seen two issues with the bucketing approach. The first issue is that I’ve never actually seen it yield any additional insight. It can’t provide insights into new patterns because the patterns have already been predefined as the buckets. The best it can do is give you one type of filter to drill down and look at some more issues in more detail. There’s some genuine value in giving you a subset of related incidents to look more closely at, but in practice, I’ve rarely seen anybody actually do the harder work of looking at these subsets.

The second issue is that incidents, being messy things, often don’t fall cleanly into exactly one bucket. Sometimes they fall into multiple, and sometimes they don’t fall into any, and sometimes it’s just really unclear. For example, an issue may involve both a new bug and a deployment problem (as anyone who has accidentally deployed a bug to production and then gotten into even more trouble when trying to roll things back can tell you). The bucket approach forces you to discard information that is potentially useful in identifying patterns by requiring you to put the incident into exactly one bucket. This inevitably leads to arguments about whether an incident should be classified into bucket A or bucket B. This sort of argument is a symptom that this approach is throwing away useful information, and that it really shouldn’t go into a single bucket at all.

The tags approach

You may hang multiple tags on an incident

Another technique is what I’ll call the tags method of categorization. With the tags method, you annotate an incident with zero, one, or multiple tags. The idea behind tagging is that you want to let the details of the incident help you come up with meaningful categories. As incidents happen, you may come up with entirely new categories, or coalesce existing ones into one, or split them apart. Tags also let you examine incidents across multiple dimensions. Perhaps you’re interested in attributes of the people that are responding (maybe there’s a “hero” tag if there’s a frequent hero who comes in to many incidents), or maybe there’s production pressure related to some new feature being developed (in which case, you may want to label with both production-pressure and feature-name), or maybe it’s related to migration work (migration). Well, there are many different dimensions. Here are some examples of potential tags:

query-scope-accidentally-too-broad
people-with-relevant-context-out-of-office
unforeseen-performance-impact

Those example tags may seem weirdly specific, but that’s OK! The tags might be very high level (e.g., production-pressure) or very low level (e.g., pipeline-stopped-in-the-middle), or anywhere in between.

Top-down vs circular

The bucket approach is strictly top-down: you enforce a categorization on incidents from the top. The tags approach is a mix of top-down and bottom-up. When you start tagging, you’ll always start with some prior model of the types of tags that you think are useful. As you go through the details of incidents, new ideas for tags will emerge, and you’ll end up revising your set of tags over time. Someone might revisit the writeup for an incident that happened years ago, and add a new tag to it. This process of tagging incidents and identifying potential new tags categories will help you identify interesting patterns.

The tag-based approach is messier than the bucket-based one, because your collection of tags may be very heterogeneous, and you’ll still encounter situations where it’s not clear whether a tag applies or not. But it will yield a lot more insight.

Does any of this sound familiar?

The other day, David Woods responded to one of my tweets with a link to a talk:

claim of root cause is ex. of component substitution fallacy. All incidents that threaten failure reveal component weaknesses due to finite
resources & tradeoffs -> easy to miss the critical systemic/emergent factors see min 25 https://t.co/OsYy2U8fsA
— David Woods (@ddwoods2) January 16, 2023

I had planned to write a post on the component substitution fallacy that he referenced, but I didn’t even make it to minute 25 of that video before I heard something else that I had to post first, at the 7:54 mark. The context of it is describing the state of NASA as an organization at the time of the Space Shuttle Columbia accident.

And here’s what Woods says in the talk:

Now, does this describe your organization?

Are you in a changing environment under resource pressures and new performance demands?

Are you being pressured to drive down costs, work with shorter, more aggressive schedules?

Are you working with new partners and new relationships where there are new roles coming into play, often as new capabilities come to bear?

Do we see changing skillsets, skill erosion in some areas, new forms of expertise that are needed?

Is there heightened stakeholder interest?

Finally, he asks:

How are you navigating these seas of complexity that NASA confronted?

How are you doing with that?

The Allspaw-Collins effect

While reading Laura Maguire’s PhD dissertation, Controlling the Costs of Coordination in Large-scale Distributed Software Systems, when I came across a term I hadn’t heard before: the Allspaw-Collins effect:

An example of how practitioners circumvent the extensive costs inherent in the Incident Command model is the Allspaw-Collins effect (named for the engineers who first noticed the pattern). This is commonly seen in the creation of side channels away from the group response effort, which is “necessary to accomplish cognitively demanding work but leaves the other participants disconnected from the progress going on in the side channel (p.81)”

Here’s my understanding:

A group of people responding to an incident have to process a large number of signals that are coming in. One example of such signals is the large number of Slack messages appearing in the incident channel as incident responders and others provide updates and ask questions. Another example would be additional alerts firing.

If there’s a designated incident commander (IC) who is responsible for coordination, the IC can become a bottleneck if they can’t keep up with the work of processing all of these incoming signals.

The effect captures how incident responders will sometimes work around this bottleneck by forming alternate communication channels so they can coordinate directly with each other, without having to mediate through the IC. For example, instead of sending messages in the main incident channel, they might DM each other or communicate in a separate (possibly private) channel.

I can imagine how this sort of side-channel communication would be officially sanctioned (“all incident-related communication should happen in the incident response channel!“), and also how it can be adaptive.

Maguire doesn’t give the first names of the people the effect is named for, but I strongly suspect they are John Allspaw and Morgan Collins.

Southwest airlines: a case study in brittleness

What happens to a complex system when it gets pushed just past its limits?

In some cases, the system in question is able to actually change its limits, so it can handle the new stressors that get thrown at it. When a system is pushed beyond its design limit, it has to change the way that it works. The system needs to adapt its own processes to work in a different way.

We use the term resilience to describe the ability of a system to adapt how it does its work, and this is what resilience engineering researchers study. These researchers have identified multiple factors that foster resilience. For example, people on the front lines of the system need autonomy to be able to change the way they work, and they also need to coordinate effectively with others in the system. A system under stress inevitably need access to additional resources, which means that there needs to be extra capacity that was held in reserve. People need to be able to anticipate trouble ahead, so that they can prepare to change how they work and deploy the extra capacity.

However, there are cases when systems fail to adapt effectively when pushed just beyond their limits. These systems face what Woods and Branlat call decompensation: they exhaust their ability to keep up with the demands placed on them, and their performance falls sharply off of a cliff. This behavior is the opposite of resilience, and researchers call it brittleness.

The ongoing problems facing Southwest Airlines provides us with a clear example of brittleness. External factors such as the large winter storm pushed the system past its limit, and it was not able to compensate effectively in the face of these stressors.

There are many reports coming out of the media now about different factors that contributed to Southwest’s brittleness. I think it’s too early to treat these as definitive. A proper investigation will likely take weeks if not months. When the investigation finally gets completed, I’m sure there will be additional factors identified that haven’t been reported on yet.

But one thing we can be sure of at this point is that Southwest Airlines fell over when pushed beyond its limits. It was brittle.

You’re just going to sit there???

Here’s a little story about something that happened last year.

A paging alert fires for a service that a sibling team manages. I’m the support on-call, meaning that I answered support questions about the delivery engineering tooling. That means my only role here is to communicate with internal users about an ongoing issue. Since I don’t know this service at all, there isn’t much else for me to do: I’m just a bystander, watching the Slack messages from the sidelines.

The operations on-call he acknowledges the page and starts digging to figure out what’s gone wrong. As he’s investigating, he’s providing updates about his progress by posting Slack messages to the on-call channel. At one point, he types this message:

Anyway… we’re dead in the water until this figures itself out.

I’m… flabbergasted. He’s just going to sit there and hope that the system becomes healthy again on its own? He’s not even going to try and remediate? Much to my relief, after a few minutes, the service recovered.

Talking to him the next day, I discovered that he had taken a remediation action: he failed over a supporting service from the primary to the secondary. His comment was referring to the fact that the service was going to be down until the failover completed. Once the secondary became the new primary, things went back to normal.

When I looked back at the Slack messages, I noticed that he had written messages to communicate that he was failing over the primary. But he had also mentioned that his initial attempt at failover didn’t work, as the operational UX was misleading. What happened was that I had misinterpreted the Slack message. I thought his attempt to fail over had simply failed entirely, and he was out of ideas.

Communicating effectively over Slack during a high-tempo event like an incident is challenging. It can be especially difficult if you don’t have a prior working relationship with the people in the ad-hoc incident response team, which can happen when an incident spans multiple teams. Getting better at communicating during an incident is a skill, both for individuals and organizations as a whole. It’s one I think we don’t pay enough attention to.

We value possession of experience, but not its acquisition

Imagine you’re being interviewed for a software engineering position, and the interviewer asks you: “Can you provide me with a list of the work items that you would do if you were hired here?” This is how the action item approach to incident retrospectives feels to me.

We don’t hire people based on their ability to come up with a set of work items. We’re hiring them for their judgment, their ability to make good engineering decisions and tradeoffs based on the problems that they will encounter at the company. In the interview process, we try to assess their expertise, which we assume they have developed based on their previous work experience.

Incidents provide us with excellent learning opportunities because they confront us with surprises. If we examine an incident in detail, we can learn something about our system behavior that we didn’t know before.

Yet, while we recognize the value of experienced candidates when we do hiring, we don’t seem to recognize the value of increasing the experience of our current employees. Incidents are a visceral type of experience, and reflecting on these sorts of experiences is what increases our expertise. But you have to reflect on them to maximize the value, and you have to share this information out to the organization so that it isn’t just the incident responders that can benefit from the experience.

To me, learning from incidents is about increasing the expertise of an organization by reflecting on and sharing out the experiences of surprising operational events. Action items are a dime a dozen. What I care about is improving the organization’s ability to engineer software.

The Howie Guide: How to get started with incident investigations

Until now, if you wanted to improve your organization’s ability to learn from incidents, there wasn’t a lot of how-to style material you could draw from. Sure, there were research papers you could read (oh, so many research papers!). But academic papers aren’t a great source of advice for someone who is starting on an effort to improve how they do incident analysis.

There simply weren’t any publications about how to get started with doing incident investigations which were targeted at the infotech industry. Your best bet was the Etsy Debrief Facilitation Guide. It was practical, but it focused on only a single aspect of the incident investigation process: the group incident retrospective meeting. And there’s so much more to incident investigation than that meeting.

The folks at Jeli have stepped up to the challenge. They just released Howie: The Post-Incident Guide.

We made a new thing! Jeli's all about evolving the industry, but building a new post-incident process is…super hard. So we did it for you. Introducing📄✨ Howie: The Post- Incident Guide. 📄✨ Download it, print it, share it, frame it. Use it to evolve. https://t.co/O4TF68oiAh
— Jeli.io (@jeli_io) December 8, 2021

Readers of this blog will know that this is a topic near and dear to my heart. The name “Howie” is short for “How we got here“, which is what we call our incident writeups at Netflix. (This isn’t a coincidence: we came up with this name at Netflix when Nora Jones of Jeli and I were on the CORE team).

Writing a guide like this is challenging, because so much of incident investigation is contextual: what you look at it, what questions you ask, will depend on what you’ve learned so far. But there are also commonalities across all investigations; the central activities (constructing timelines, doing one-on-one interviews, building narratives) happen each time. The Howie guide gently walks the newcomer through these. It’s accessible.

When somebody says, “OK, I believe there’s value in learning more from incidents, and we want to go beyond doing a traditional root-cause-analysis. But what should I actually do?”, we now have a canonical answer: go read Howie.