netflix – Surfing Complexity

A couple of people have asked me to share how I structure my OOPS write-ups. Here’s what they look like when I write them. This structure in this post is based on the OOPS template that has evolved over time inside of Netflix, with contributions from current and former members of the CORE team.

My personal outline looks like this (the bold sections are the ones that I include in every writeup)

Title
Executive summary
Background
Narrative description
- Prologue
- The trigger
- Impact
- Epilogue
Contributors/enablers
Mitigators
Risks
Challenges in handling

Title: OOPS-NNN: How we got here

Every OOPS I write up has the same title, “how we got here”. However, the name of the Google doc itself (different from the title) is a one-line summary, for example: “Server groups stuck in ‘deploying’ state”.

Executive summary

I start each write-up with a summary section that’s around three paragraphs. I usually try to capture:

When it happened
The impact
Explanation of the failure mode
Aspects about this incident that were particularly difficult

On <date>, from <start time> to <end time>, users were unable to <symptom>

The failure mode was triggered by an unintended change in <service> that led to <surprising behavior>.

The issue was made more difficult to diagnose/remediate due to a number of factors:

<first factor>
<second factor>
…

I’ll sometimes put the trigger in the summary, as in the example above. It’s important not to think of the trigger as the “root cause”. For example, if an incident involves TLS certificates expiring, then the trigger is the passage of time. I talk more about the trigger in the “narrative description” section below.

Background

It’s almost always the case that the reader will need to have some technical knowledge about the system in order to make sense of the incident. I often put in a background section where I provide just enough technical details to help the reader understand the rest of the writeup. Here’s an example background section:

Managed Delivery (MD) supports a GitOps-style workflow. For apps that are on Managed Delivery, engineers can make delivery-related changes to the app by editing a file in their app’s Stash repository called the delivery config.

To support this workflow, Managed Delivery must be able to identify when a new commit has happened to the default branch of a managed app, and read the delivery config associated with that commit.

The initial implementation of this functionality used a custom Spinnaker pipeline for doing these imports. When an application was onboarded to Managed Delivery, newt would create a special pipeline named import-delivery-config. This pipeline was triggered by commits to the default branch, and would execute a custom pipeline stage that would retrieve the delivery config from Stash and push it to keel, the service that powers Managed Delivery.

This solution, while functional, was inelegant: it exposed an implementation detail of Managed Delivery to end-users, and made it more difficult for users to identify import errors. A better solution would be to have keel identify when commits happen to the repositories of managed apps and import the delivery config directly. This solution was implemented recently, and all apps previously using pipelines were automatically migrated to the native git integration. As will be revealed in the narrative, an unexpected interaction involving the native git integration functionality contributed to this OOPS.

Narrative description

The narrative is the heart of the writeup. If I don’t have enough time to do a complete writeup, then I will just do an executive summary and a narrative description, and skip all of the other sections.

Since the narrative description is often quite long (over ten pages, sometimes many more), I break it up into sections and sub-sections. I typically use the following top-level sections.

Prologue
The trigger
Impact
Epilogue

Prologue

In every OOPS I’ve ever written up, implementation decisions and changes that happen well before the incident play a key role in understanding how the system got into a dangerous state. I use the Prologue section to document these, as well as describing how those decisions were reasonable when they happened.

I break the prologue up into subsections, and I include timeline information in the subsection headers. Here are some examples of prologue subsection headers I’ve used (note: these are from different OOPS writeups).

New apps with delivery configs, but aren’t on MD (5 months before impact)
Implementing the git integration (4 months before impact)
Always using the latest version of a platform library (4 months before impact)
A successful <foo> plugin deployment test (8 days before impact)
A weekend fix is deployed to staging (4 days before impact)
Migrating existing apps (3-4 days before impact)
A dependency update eludes dependency locking (1 day before impact)

I often use foreshadowing in my prologue section writeups. Her are some examples:

It will be several months before keel launches its first Titus Run Job orca task. Until one of those new tasks fails, nobody will know that a query against orca for task status can return a payload that keel is incapable of deserializing.

The scope of the query in step 2 above will eventually interact with another part of the system, which will broaden the blast radius of the operational surprise. But that won’t happen for another five months.

Unknown at the time, this PR introduced two bugs:
1. <description of first bug>
2. <description of second bug>
Note that the first bug masks the second. The first bug will become apparent as soon as the code is deployed to production, which will happen in three days. The second bug will lay unnoticed for eleven days.

The trigger

The “trigger” section is the shortest one, but I like to have it as a separate section because it acts as what my colleague J. Paul Reed calls a “pivot point”, a crucial moment in the story of the incident. This section should describe how the system transitions into a state where there is actual customer impact. I usually end the trigger section with some text in red that describes the hazardous state that the system is now in.

Here’s an example of a trigger section:

Trigger: a submitted delivery config

On <date>, at <time>, <name> commits a change to their delivery config that populates the artifacts section. With the delivery config now complete, they submit it to Spinnaker, then point their browser at the environments view of the <app> app, where they can observe Spinnaker manage the app’s deployment.

When <name> submits their delivery config, keel performs the following events:

receives the delivery config via REST API.
deserializes the delivery config from YAML into POJOs.
serializes the config into JSON objects.
writes the JSON objects to the database.

At this point, keel has entered a bad state: it has written JSON objects into the resource table that it will not be able to deserialize.

Impact

The impact section is the longest part of the narrative: it covers everything from the trigger until the system has returned to a stable state. Like the prologue section, I chunk it into subsections. These act as little episodes to make it easier for the reader to follow what’s happening.

Here are examples of some titles for impact subsections I’ve used:

User reports access denied on unpin
Pinning the library back
Maybe it’s gate?
Deploying the version with the library pinned back
Let’s try rolling back staging
Staging is good, let’s do prod
Where did the <X> headers go?
Rollback to main is complete
We’re stable, but why did it break?

For some incidents, I’ll annotate these headers with the timing, like I did in the prologue (e.g., “45 minutes after impact”).

Because so much of our incident coordination is over Slack these days, my impact section will typically have pasted screeenshots of Slack conversation snippets, interspersed with text. I’ll typically write some text that summarizes the interaction, and then paste a screenshot, e.g.:

<name> notes something strange in keel’s gradle.properties: it has multiple version parameters where it should only have one:

[Slack screenshot here]

The impact section is mostly written chronologically. However, because it is chunked into episodic subsections, sometimes it’s not strictly in chronological order. I try to emphasize the flow of the narrative over being completely faithful to the ordering of the events. The subsections often describe activities that are going on in parallel, and so describing the incident in the strict ordering of the events would be too difficult to follow.

Epilogue

I’ll usually have an epilogue section that documents work done in the wake of the incident. I split this into subsections as well. An example of a subsection: Fixing the dependency locking issue

Contributors/enablers

Here’s the guidance in the template for the contributors and enablers section:

Various contributors and enablers create vulnerabilities that remain latent in the system (sometimes for long periods of time). Think of these as things that had to be true in order for the incident to take place, or somehow made it worse.

This section is broken up into subsections, one subsection for each contributor. I typically write these at a very low-level of abstraction, where my colleague J. Paul Reed writes these at a higher level.

I think it’s useful to call the various contributors out explicitly because it brings home how complex the incident really was.

Here are some example subsection titles:

Violated assumptions about version strings
Scope of SQL query
Beans not scanned at startup after Titus refactor
Incomplete TitusClusterSpecDeserializer
Metadata field not populated for PublishedArtifact objects
Resilience4J annotations and Kotlin suspend functions
Transient errors immediately before deploying to staging
Artifact versioning complexity
Production pinned for several days
No attempts to deploy to production for several days
Three large-ish changes landed at about the same time
Holidays and travel
Alerts focus on keel errors and resource checks

Mitigators

The guidance we give looks like this:

Which factors helped reduce the impact of this operational surprise?

Like the contributors/enablers section, this is broken up into subsections. Here are some examples of subsection titles:

RADAR alerts caught several issues in staging
<name> recognized Titus API refactor as a trigger for an issue in production
<name> quickly diagnoses artifact metadata issue
<name>’s hypothesis about transactions rolling back due to error
<name> recognized query too broad
<name> notices spike in actuations

Risks

Here’s the guidance for this section from the template:

Risks are items (technical architecture or coordination/team related) that created danger in the system. Did the incident reveal any new risks or reinforce the danger of any known risks? (Avoid hindsight bias when describing risks.)

The risks section is where I abstract up some of the contributors to identify higher-level patterns. Here are some example risk subsection titles:

Undesired mass actuation
Maintaining two similar things in the codebase
Problems with dynamic configuration that are only detectable at runtime
Plugins that violate assumptions in the main codebase
Not deploying to prod for a while

Challenges in handling

Here’s the guidance for this section from the template:

Highlight the obstacles we had to overcome during handling. Was there anything particularly novel, confusing, or otherwise difficult to deal with? How did we figure out what to do? What decisions were made? (Capturing this can be helpful for teaching others how we troubleshoot and improvise).

In particular, were there unproductive threads of action? Capture avenues that people explored and mitigations that were attempted that did not end up being fruitful.

Sometimes it’s not clear what goes into a contributor and what goes into a challenge. You could put all of these into “contributors” and not write this section at all. However, I think it’s useful to call out what explicitly made the incident difficult to handle. Here are some example subsection headers:

Long time to diagnose and remediate
Limited signals for making sense of underlying problem
Error checking task status as red herring

Other sections

The template has some other sections (incident artifacts, follow-up items, timeline and links), but I often don’t include those in my own writeups. I’ll always do a timeline document as input for writing up the OOPS, and I will typically link it for reference, but I don’t expect anybody to read it. I don’t see the OOPS writeup as the right vehicle for tracking follow-up work, so I don’t put a section in it.

I often struggle to describe the project that I work on at my day job, even though it’s an open-source project that even has its own domain name: managed.delivery. I’ll often mumble something like, “it’s a declarative deployment system”. But that explanation does not yield much insight.

I’m going to use Kubernetes as an analogy to explain my understanding of Managed Delivery. This is dangerous, because I’m not a Kubernetes user(!). But if I didn’t want to live dangerously, I wouldn’t blog.

With Kubernetes, you describe the desired state of your resources declaratively, and then the system takes action to bring the current state of the system to the desired state. In particular, when you use Kubernetes to launch a pod of containers, you need to specify the container image name and version to be deployed as part of the desired state.

When a developer pushes new code out, they need to change the desired state of a resource, specifically, the container image version. This means that a deployment system needs some mechanism for changing the desired state.

A common pattern we see is that service owners have a notion of an environment (e.g., test, staging, prod). For example, maybe they’ll deploy the code to test, and maybe run some automated tests against it, and if it looks good, they’ll promote to staging, and maybe they’ll do some manual tests, and if they’re happy, they’ll promote out to prod.

Imagine test, staging, and prod all have version v23 of the code running in it. After version v24 is cut, it will first be deployed in test, then staging, then prod. That’s how each version will propagate through these environments, assuming it meets the promotion constraints for each environment (e.g., tests pass, human makes a judgment).

You can think of this kind of promoting-code-versions-through-environments as a pattern for describing how the desired states of the environments changes over time. And you can describe this pattern declaratively, rather than imperatively like you would with traditional pipelines.

And that’s what Managed Delivery is. It’s a way of declaratively describing how the desired state of the resources should evolve over time. To use a calculus analogy, you can think of Managed Delivery as representing the time-derivative of the desired state function.

If you think of Kubernetes as a system for specifying desired state, Managed Delivery is a system for specifying how desired state evolves over time

With Managed Delivery, you can say express concepts like:

for a code version to be promoted to the staging environment, it must
- be successfully deployed to the test environment
- pass a suite of end-to-end automated tests specified by the app owner

and then Managed Delivery uses these environment promotion specifications to shepherd the code through the environments.

And that’s it. Managed Delivery is a system that lets users describe how the desired state changes over time, by letting them specify environments and the rules for promoting change from one from environment to the next.

Category: netflix

OOPS writeups