Does any of this sound familiar?

The other day, David Woods responded to one of my tweets with a link to a talk:

I had planned to write a post on the component substitution fallacy that he referenced, but I didn’t even make it to minute 25 of that video before I heard something else that I had to post first, at the 7:54 mark. The context of it is describing the state of NASA as an organization at the time of the Space Shuttle Columbia accident.

And here’s what Woods says in the talk:

Now, does this describe your organization?

Are you in a changing environment under resource pressures and new performance demands?

Are you being pressured to drive down costs, work with shorter, more aggressive schedules?

Are you working with new partners and new relationships where there are new roles coming into play, often as new capabilities come to bear?

Do we see changing skillsets, skill erosion in some areas, new forms of expertise that are needed?

Is there heightened stakeholder interest?

Finally, he asks:

How are you navigating these seas of complexity that NASA confronted?

How are you doing with that?

The Allspaw-Collins effect

While reading Laura Maguire’s PhD dissertation, Controlling the Costs of Coordination in Large-scale Distributed Software Systems, when I came across a term I hadn’t heard before: the Allspaw-Collins effect:

An example of how practitioners circumvent the extensive costs inherent in the Incident Command model is the Allspaw-Collins effect (named for the engineers who first noticed the pattern). This is commonly seen in the creation of side channels away from the group response effort, which is “necessary to accomplish cognitively demanding work but leaves the other participants disconnected from the progress going on in the side channel (p.81)”

Here’s my understanding:

A group of people responding to an incident have to process a large number of signals that are coming in. One example of such signals is the large number of Slack messages appearing in the incident channel as incident responders and others provide updates and ask questions. Another example would be additional alerts firing.

If there’s a designated incident commander (IC) who is responsible for coordination, the IC can become a bottleneck if they can’t keep up with the work of processing all of these incoming signals.

The effect captures how incident responders will sometimes work around this bottleneck by forming alternate communication channels so they can coordinate directly with each other, without having to mediate through the IC. For example, instead of sending messages in the main incident channel, they might DM each other or communicate in a separate (possibly private) channel.

I can imagine how this sort of side-channel communication would be officially sanctioned (“all incident-related communication should happen in the incident response channel!“), and also how it can be adaptive.

Maguire doesn’t give the first names of the people the effect is named for, but I strongly suspect they are John Allspaw and Morgan Collins.

Southwest airlines: a case study in brittleness

What happens to a complex system when it gets pushed just past its limits?

In some cases, the system in question is able to actually change its limits, so it can handle the new stressors that get thrown at it. When a system is pushed beyond its design limit, it has to change the way that it works. The system needs to adapt its own processes to work in a different way.

We use the term resilience to describe the ability of a system to adapt how it does its work, and this is what resilience engineering researchers study. These researchers have identified multiple factors that foster resilience. For example, people on the front lines of the system need autonomy to be able to change the way they work, and they also need to coordinate effectively with others in the system. A system under stress inevitably need access to additional resources, which means that there needs to be extra capacity that was held in reserve. People need to be able to anticipate trouble ahead, so that they can prepare to change how they work and deploy the extra capacity.

However, there are cases when systems fail to adapt effectively when pushed just beyond their limits. These systems face what Woods and Branlat call decompensation: they exhaust their ability to keep up with the demands placed on them, and their performance falls sharply off of a cliff. This behavior is the opposite of resilience, and researchers call it brittleness.

The ongoing problems facing Southwest Airlines provides us with a clear example of brittleness. External factors such as the large winter storm pushed the system past its limit, and it was not able to compensate effectively in the face of these stressors.

There are many reports coming out of the media now about different factors that contributed to Southwest’s brittleness. I think it’s too early to treat these as definitive. A proper investigation will likely take weeks if not months. When the investigation finally gets completed, I’m sure there will be additional factors identified that haven’t been reported on yet.

But one thing we can be sure of at this point is that Southwest Airlines fell over when pushed beyond its limits. It was brittle.

You’re just going to sit there???

Here’s a little story about something that happened last year.

A paging alert fires for a service that a sibling team manages. I’m the support on-call, meaning that I answered support questions about the delivery engineering tooling. That means my only role here is to communicate with internal users about an ongoing issue. Since I don’t know this service at all, there isn’t much else for me to do: I’m just a bystander, watching the Slack messages from the sidelines.

The operations on-call he acknowledges the page and starts digging to figure out what’s gone wrong. As he’s investigating, he’s providing updates about his progress by posting Slack messages to the on-call channel. At one point, he types this message:

Anyway… we’re dead in the water until this figures itself out.

I’m… flabbergasted. He’s just going to sit there and hope that the system becomes healthy again on its own? He’s not even going to try and remediate? Much to my relief, after a few minutes, the service recovered.

Talking to him the next day, I discovered that he had taken a remediation action: he failed over a supporting service from the primary to the secondary. His comment was referring to the fact that the service was going to be down until the failover completed. Once the secondary became the new primary, things went back to normal.

When I looked back at the Slack messages, I noticed that he had written messages to communicate that he was failing over the primary. But he had also mentioned that his initial attempt at failover didn’t work, as the operational UX was misleading. What happened was that I had misinterpreted the Slack message. I thought his attempt to fail over had simply failed entirely, and he was out of ideas.

Communicating effectively over Slack during a high-tempo event like an incident is challenging. It can be especially difficult if you don’t have a prior working relationship with the people in the ad-hoc incident response team, which can happen when an incident spans multiple teams. Getting better at communicating during an incident is a skill, both for individuals and organizations as a whole. It’s one I think we don’t pay enough attention to.

We value possession of experience, but not its acquisition

Imagine you’re being interviewed for a software engineering position, and the interviewer asks you: “Can you provide me with a list of the work items that you would do if you were hired here?” This is how the action item approach to incident retrospectives feels to me.

We don’t hire people based on their ability to come up with a set of work items. We’re hiring them for their judgment, their ability to make good engineering decisions and tradeoffs based on the problems that they will encounter at the company. In the interview process, we try to assess their expertise, which we assume they have developed based on their previous work experience.

Incidents provide us with excellent learning opportunities because they confront us with surprises. If we examine an incident in detail, we can learn something about our system behavior that we didn’t know before.

Yet, while we recognize the value of experienced candidates when we do hiring, we don’t seem to recognize the value of increasing the experience of our current employees. Incidents are a visceral type of experience, and reflecting on these sorts of experiences is what increases our expertise. But you have to reflect on them to maximize the value, and you have to share this information out to the organization so that it isn’t just the incident responders that can benefit from the experience.

To me, learning from incidents is about increasing the expertise of an organization by reflecting on and sharing out the experiences of surprising operational events. Action items are a dime a dozen. What I care about is improving the organization’s ability to engineer software.

The Howie Guide: How to get started with incident investigations

Until now, if you wanted to improve your organization’s ability to learn from incidents, there wasn’t a lot of how-to style material you could draw from. Sure, there were research papers you could read (oh, so many research papers!). But academic papers aren’t a great source of advice for someone who is starting on an effort to improve how they do incident analysis.

There simply weren’t any publications about how to get started with doing incident investigations which were targeted at the infotech industry. Your best bet was the Etsy Debrief Facilitation Guide. It was practical, but it focused on only a single aspect of the incident investigation process: the group incident retrospective meeting. And there’s so much more to incident investigation than that meeting.

The folks at Jeli have stepped up to the challenge. They just released Howie: The Post-Incident Guide.

Readers of this blog will know that this is a topic near and dear to my heart. The name “Howie” is short for “How we got here“, which is what we call our incident writeups at Netflix. (This isn’t a coincidence: we came up with this name at Netflix when Nora Jones of Jeli and I were on the CORE team).

Writing a guide like this is challenging, because so much of incident investigation is contextual: what you look at it, what questions you ask, will depend on what you’ve learned so far. But there are also commonalities across all investigations; the central activities (constructing timelines, doing one-on-one interviews, building narratives) happen each time. The Howie guide gently walks the newcomer through these. It’s accessible.

When somebody says, “OK, I believe there’s value in learning more from incidents, and we want to go beyond doing a traditional root-cause-analysis. But what should I actually do?”, we now have a canonical answer: go read Howie.

“What could we have done differently?”

During incident retrospective meetings, I’ve often heard someone ask: “What could we have done differently?” I don’t like this question, and so I never ask it.

A world that never was

I am a firm believer in the idea that the best way to get better at dealing with incidents is to understand how incidents actually happen. After an incident happens, I focus all of my energies on the understanding aspect, because the window of opportunity for studying the incident closes quickly.

Asking “what could we have done differently?” can’t teach us anything about how the incident happened, because it’s asking us to imagine an alternate reality where events unfolded differently. You can’t get a better understanding of why an incident responder took action X by imagining a world where the responder took action Y.

Instead of asking how it could have unfolded differently, you’ll learn a lot more about the incident if you try to understand the frame of mind of the incident responders. What did they see? What did they know at the time? What was confusing to them?

The future, not the past

I believe the question is well-intended, to help us prevent the incident from recurring. In that case, I think a better question would be something along the lines of: “If we encounter similar symptoms in a future incident, what actions should we take?” This sounds like the same question, but it’s not:

“If we encounter similar symptoms” introduces uncertainty into the exercise – the future incident may look like the last one, but it might be different with the same symptoms! When we ask about doing things differently in the past, it’s all too easy to forget about this uncertainty.

Uncertainty is one of the defining characteristics of an incident. The system is behaving in an unexpected way, and we don’t understand why! When we look back on an incident, we should focus on this uncertainty rather than elide it.

Another reason that imagining future scenarios is better that counterfactuals about past scenarios is that our system in the future is different from the one in the past. For example:

  • You may have made changes to the system in the wake of the last incident that prevents the incident from recurring in exactly the same way as before, so the question turns out to be moot.
  • You may have improved the operability of your system in some way (e.g., added an admin interface so you can make an API call instead of poking at the database), so that you have new actions you can take in the future that you couldn’t take in the past.

While I still probably wouldn’t ask this question (I want to spend all of my energy understanding the incident), I think it’s a much better question, because it gives us practice at anticipating future incidents.

OOPS writeups

A couple of people have asked me to share how I structure my OOPS write-ups. Here’s what they look like when I write them. This structure in this post is based on the OOPS template that has evolved over time inside of Netflix, with contributions from current and former members of the CORE team.

My personal outline looks like this (the bold sections are the ones that I include in every writeup)

  • Title
  • Executive summary
  • Background
  • Narrative description
    • Prologue
    • The trigger
    • Impact
    • Epilogue
  • Contributors/enablers
  • Mitigators
  • Risks
  • Challenges in handling

Title: OOPS-NNN: How we got here

Every OOPS I write up has the same title, “how we got here”. However, the name of the Google doc itself (different from the title) is a one-line summary, for example: “Server groups stuck in ‘deploying’ state”.

Executive summary

I start each write-up with a summary section that’s around three paragraphs. I usually try to capture:

  • When it happened
  • The impact
  • Explanation of the failure mode
  • Aspects about this incident that were particularly difficult

On <date>, from <start time> to <end time>, users were unable to <symptom>

The failure mode was triggered by an unintended change in <service> that led to <surprising behavior>.

The issue was made more difficult to diagnose/remediate due to a number of factors:

  • <first factor>
  • <second factor>

I’ll sometimes put the trigger in the summary, as in the example above. It’s important not to think of the trigger as the “root cause”. For example, if an incident involves TLS certificates expiring, then the trigger is the passage of time. I talk more about the trigger in the “narrative description” section below.

Background

It’s almost always the case that the reader will need to have some technical knowledge about the system in order to make sense of the incident. I often put in a background section where I provide just enough technical details to help the reader understand the rest of the writeup. Here’s an example background section:

Managed Delivery (MD) supports a GitOps-style workflow. For apps that are on Managed Delivery, engineers can make delivery-related changes to the app by editing a file in their app’s Stash repository called the delivery config.  

To support this workflow, Managed Delivery must be able to identify when a new commit has happened to the default branch of a managed app, and read the delivery config associated with that commit.

The initial implementation of this functionality used a custom Spinnaker pipeline for doing these imports. When an application was onboarded to Managed Delivery, newt would create a special pipeline named import-delivery-config. This pipeline was triggered by commits to the default branch, and would execute a custom pipeline stage that would retrieve the delivery config from Stash and push it to keel, the service that powers Managed Delivery.


This solution, while functional, was inelegant: it exposed an implementation detail of Managed Delivery to end-users, and made it more difficult for users to identify import errors. A better solution would be to have keel identify when commits happen to the repositories of managed apps and import the delivery config directly. This solution was implemented recently, and all apps previously using pipelines were automatically migrated to the native git integration. As will be revealed in the narrative, an unexpected interaction involving the native git integration functionality contributed to this OOPS.

Narrative description

The narrative is the heart of the writeup. If I don’t have enough time to do a complete writeup, then I will just do an executive summary and a narrative description, and skip all of the other sections.

Since the narrative description is often quite long (over ten pages, sometimes many more), I break it up into sections and sub-sections. I typically use the following top-level sections.

  • Prologue
  • The trigger
  • Impact
  • Epilogue

Prologue

In every OOPS I’ve ever written up, implementation decisions and changes that happen well before the incident play a key role in understanding how the system got into a dangerous state. I use the Prologue section to document these, as well as describing how those decisions were reasonable when they happened.

I break the prologue up into subsections, and I include timeline information in the subsection headers. Here are some examples of prologue subsection headers I’ve used (note: these are from different OOPS writeups).

  • New apps with delivery configs, but aren’t on MD (5 months before impact)
  • Implementing the git integration (4 months before impact)
  • Always using the latest version of a platform library (4 months before impact)
  • A successful <foo> plugin deployment test (8 days before impact)
  • A weekend fix is deployed to staging (4 days before impact)
  • Migrating existing apps (3-4 days before impact)
  • A dependency update eludes dependency locking (1 day before impact)

I often use foreshadowing in my prologue section writeups. Her are some examples:

It will be several months before keel launches its first Titus Run Job orca task. Until one of those new tasks fails, nobody will know that a query against orca for task status can return a payload that keel is incapable of deserializing.


The scope of the query in step 2 above will eventually interact with another part of the system, which will broaden the blast radius of the operational surprise. But that won’t happen for another five months.


Unknown at the time, this PR introduced two bugs:
1. <description of first bug>
2. <description of second bug>
Note that the first bug masks the second. The first bug will become apparent as soon as the code is deployed to production, which will happen in three days. The second bug will lay unnoticed for eleven days.

The trigger

The “trigger” section is the shortest one, but I like to have it as a separate section because it acts as what my colleague J. Paul Reed calls a “pivot point”, a crucial moment in the story of the incident. This section should describe how the system transitions into a state where there is actual customer impact. I usually end the trigger section with some text in red that describes the hazardous state that the system is now in.

Here’s an example of a trigger section:

Trigger: a submitted delivery config

On <date>, at <time>, <name> commits a change to their delivery config that populates the artifacts section. With the delivery config now complete, they submit it to Spinnaker, then point their browser at the environments view of the <app> app, where they can observe Spinnaker manage the app’s deployment.

When <name> submits their delivery config, keel performs the following events:

  1. receives the delivery config via REST API.
  2. deserializes the delivery config from YAML into POJOs.
  3. serializes the config into JSON objects.
  4. writes the JSON objects to the database.

At this point, keel has entered a bad state: it has written JSON objects into the resource table that it will not be able to deserialize. 

Impact

The impact section is the longest part of the narrative: it covers everything from the trigger until the system has returned to a stable state. Like the prologue section, I chunk it into subsections. These act as little episodes to make it easier for the reader to follow what’s happening.

Here are examples of some titles for impact subsections I’ve used:

  • User reports access denied on unpin
  • Pinning the library back
  • Maybe it’s gate?
  • Deploying the version with the library pinned back
  • Let’s try rolling back staging
  • Staging is good, let’s do prod
  • Where did the <X> headers go?
  • Rollback to main is complete
  • We’re stable, but why did it break?

For some incidents, I’ll annotate these headers with the timing, like I did in the prologue (e.g., “45 minutes after impact”).

Because so much of our incident coordination is over Slack these days, my impact section will typically have pasted screeenshots of Slack conversation snippets, interspersed with text. I’ll typically write some text that summarizes the interaction, and then paste a screenshot, e.g.:

<name> notes something strange in keel’s gradle.properties: it has multiple version parameters where it should only have one:

[Slack screenshot here]

The impact section is mostly written chronologically. However, because it is chunked into episodic subsections, sometimes it’s not strictly in chronological order. I try to emphasize the flow of the narrative over being completely faithful to the ordering of the events. The subsections often describe activities that are going on in parallel, and so describing the incident in the strict ordering of the events would be too difficult to follow.

Epilogue

I’ll usually have an epilogue section that documents work done in the wake of the incident. I split this into subsections as well. An example of a subsection: Fixing the dependency locking issue

Contributors/enablers

Here’s the guidance in the template for the contributors and enablers section:

Various contributors and enablers create vulnerabilities that remain latent in the system (sometimes for long periods of time). Think of these as things that had to be true in order for the incident to take place, or somehow made it worse.

This section is broken up into subsections, one subsection for each contributor. I typically write these at a very low-level of abstraction, where my colleague J. Paul Reed writes these at a higher level.

I think it’s useful to call the various contributors out explicitly because it brings home how complex the incident really was.

Here are some example subsection titles:

  • Violated assumptions about version strings
  • Scope of SQL query
  • Beans not scanned at startup after Titus refactor
  • Incomplete TitusClusterSpecDeserializer
  • Metadata field not populated for PublishedArtifact objects
  • Resilience4J annotations and Kotlin suspend functions
  • Transient errors immediately before deploying to staging
  • Artifact versioning complexity
  • Production pinned for several days
  • No attempts to deploy to production for several days
  • Three large-ish changes landed at about the same time 
  • Holidays and travel
  • Alerts focus on keel errors and resource checks

Mitigators

The guidance we give looks like this:

Which factors helped reduce the impact of this operational surprise?

Like the contributors/enablers section, this is broken up into subsections. Here are some examples of subsection titles:

  • RADAR alerts caught several issues in staging
  • <name> recognized Titus API refactor as a trigger for an issue in production
  • <name> quickly diagnoses artifact metadata issue
  • <name>’s hypothesis about transactions rolling back due to error
  • <name> recognized query too broad
  • <name> notices spike in actuations

Risks

Here’s the guidance for this section from the template:

Risks are items (technical architecture or coordination/team related) that created danger in the system. Did the incident reveal any new risks or reinforce the danger of any known risks? (Avoid hindsight bias when describing risks.)

The risks section is where I abstract up some of the contributors to identify higher-level patterns. Here are some example risk subsection titles:

  • Undesired mass actuation
  • Maintaining two similar things in the codebase
  • Problems with dynamic configuration that are only detectable at runtime
  • Plugins that violate assumptions in the main codebase
  • Not deploying to prod for a while

Challenges in handling

Here’s the guidance for this section from the template:

Highlight the obstacles we had to overcome during handling. Was there anything particularly novel, confusing, or otherwise difficult to deal with? How did we figure out what to do? What decisions were made? (Capturing this can be helpful for teaching others how we troubleshoot and improvise). 

In particular, were there unproductive threads of action? Capture avenues that people explored and mitigations that were attempted that did not end up being fruitful.

Sometimes it’s not clear what goes into a contributor and what goes into a challenge. You could put all of these into “contributors” and not write this section at all. However, I think it’s useful to call out what explicitly made the incident difficult to handle. Here are some example subsection headers:

  • Long time to diagnose and remediate
  • Limited signals for making sense of underlying problem
  • Error checking task status as red herring

Other sections

The template has some other sections (incident artifacts, follow-up items, timeline and links), but I often don’t include those in my own writeups. I’ll always do a timeline document as input for writing up the OOPS, and I will typically link it for reference, but I don’t expect anybody to read it. I don’t see the OOPS writeup as the right vehicle for tracking follow-up work, so I don’t put a section in it.

The danger of hidden functional roles

There’s a collection of friends that I have a standing videochat with every couple of weeks. We had been meeting at 8am, but several people developed conflicts at that time, including me. I have a teenager that starts school at 8am, and I’m responsible for getting them to school in the morning (I like to leave the house around 7:40am), which prevented me from participating.

As a group, we decided to reschedule the chat to 7am. This works well for me, because I get up at 6. Today was the first day meeting at the new scheduled time. I got up as I normally do, and was sure to be quiet so as not to wake my wife, Stacy; she sleeps later than I do, but she gets up early enough to rouse our kids for school. I even closed the bedroom door so that any noise I made from the videochat wouldn’t disturb her.

I was on the videochat, taking part in the conversation in hushed tones, when I looked over at the time. I saw it was 7:25am, which is about fifteen minutes before I start getting ready to leave the house. Usually, the rest of the household is up, showering, eating breakfast. But I hadn’t heard a peep from anyone. I went upstairs to discover that nobody else had gotten up yet.

It turns out that my typical morning routine was acting as a natural alarm clock for Stacy. My alarm goes off at 6am every weekday, and I get up, but Stacy stays in bed. However, the noises from my normal morning routine are the thing that rouse her, which is typically around 7am. Today, I was careful to be very quiet, and so she didn’t wake up. I didn’t know that I was functioning as an alarm clock for her! That’s why I was careful to be quiet, and why I didn’t even think to mention to her about the new videochat time.

I suspect this is failure mode is more common than we realize: there is a process inside a system, and over time the process comes to fulfill some unintended, ancillary functional role, and there are people who participate in this process that aren’t even aware of this function.

As an example, consider Chaos Monkey. Chaos Monkey’s intended function is to ensure that engineers design their services to withstand a virtual machine instance failing unexpectedly, by increasing their exposure to this failure mode. But Chaos Monkey also has the unintended effect of recycling instances. For teams that deploy very infrequently, their service might exhibit problems with long-lived instances that they never notice because Chaos Monkey tends to terminate instances before they hit those problems. Now imagine declaring an extended period of time where, in the interest of reducing risk(!), no new code is deployed, and Chaos Monkey is disabled.

When you turn something off, you never know what might break. In some cases, nobody in the system knows.

Plus c’est la même chose, plus ça change

I’m re-reading a David Woods’s paper titled the theory of graceful extensibility: basic rules that govern adaptive systems. The paper proposes a theory to explain how certain types of systems are able to adapt over and over again to changes in their environment. He calls this phenomenon sustained adaptability, which we contrasts with systems that can initially adapt to an environment but later collapse when some feature of the environment changes and they fail to adapt to the new change.

Woods outlines six requirements that any explanatory theory of sustained adaptability must have. Here’s the fourth one (emphasis in the original):

Fourth, a candidate theory needs to provide a positive means for a unit at any scale to adjust how it adapts in the pursuit of improved fitness (how it is well matched to its environment), as changes and challenges continue apace. And this capability must be centered on the limits and perspective of that unit at that scale.

The phrase adjust how it adapts really struck me. Since adaptation is a type of change, this is referring to a second-order change process: these adaptive units have the ability to change the mechanism by which they change themselves! This notion reminded me of Chris Argyris’s idea of double-loop learning.

Woods’s goal is to determine what properties a system must have, what type of architecture it needs, in order to achieve this second-order change process. He outlines in the paper that any such system must be a layered architecture of units that can adapt themselves and coordinate with each other, which he calls a tangled, layered network.

Woods believes there are properties that are fundamental to systems that exhibit sustained adaptability, which implies that these fundamental properties don’t change! A tangled, layered network may reconfigure itself in all sorts of different ways over time, but it must still be tangled and layered (and maintain other properties as well).

The more such systems stay the same, the more they change.