Contributors, mitigators & risks: Stripe 2019-07-10 outage

Stripe’s CTO, David Singleton, did a detailed narrative writeup of the incident they had on 2019-07-10. I love narrative descriptions of incidents, and there’s a ton of great detail here.

As an exercise, using the writeup, I collected some aspects of the incident into the following sections:

Contributing factors: what were all of the conditions that had to be present for the outage to happen, or for it to be as severe as it was? It’s important not to think of these as causes, or mistakes, or even bad things.

Mitigators: What kept the incident from being worse that it was?

Risks: What are the more general risks that this incident reveals?

In an ideal world, I’d talk to the people involved directly to get more details, but we work with what we have. I don’t summarize the incident here, so I recommend reading the Stripe writeup first.

Text in italics is copied verbatim from the writeup.

Contributing factors

Minor database version upgrade

Three months ago, we upgraded our databases to a new minor version.the new version … introduced a subtle fault in the database’s failover system that only manifested in the presence of multiple stalled nodes.

One shard had multiple stalled nodes

Two nodes became stalled for yet-to-be-determined reasons. …  a subtle fault in the database’s failover system … only manifested in the presence of multiple stalled nodesOn the day of the events, one shard was in the specific state that triggered this fault, and the shard was unable to elect a new primary.

Stalled nodes reported as healthy

 These [stalled] nodes stopped emitting metrics reporting their replication lag but continued to respond as healthy to active checks.

The database nodes support health checks, but these health checks did not detect the problem. We aren’t provided with any more details about the health check failure mode.

Database writes time out when shard has no elected primary

Without a primary, the shard was unable to accept writes. Applications that write to the shard began to time out. 

Problem manifested on a critical shard

Stripe splits data by kind into different database clusters and by quantity into different shards. Because of widespread use of this shard across applications, including the API, the unavailability of this shard … cascaded into a severe API degradation.

Based on this description, it sounds like this incident would have been much less severe if the problem had manifested on a shard other than this one.

We don’t know how it was that this particular shard was the one where the nodes stalled. It might just be bad luck. Sometimes, that’s the only difference between an incident and a surprise.

Timeouts lead to compute resource starvation

Applications that write to the shard began to time out. Because of widespread use of this shard across applications, including the API, the unavailability of this shard starved compute resources for the API

Novel, complex failure mode

  • [2019-07-10 16:36 UTC] Our team was alerted and we began incident response.
  • [2019-07-10 16:50 UTC] We determined the cluster was unable to elect a primary.

Because this was a complex failure mode that we had not previously experienced, we needed to diagnose the underlying cause and determine the steps to remediate.

The language in the writeup suggests that the complexity and novelty of the failure mode made it more difficult for them to diagnose the problem. However, the timeline suggests that it took them about 14 minutes to figure out that the database was in a bad state. That sounds pretty good to me(!).

Remediation required restarting database cluster

  • [2019-07-10 16:50 UTC] We determined the cluster was unable to elect a primary.
  • [2019-07-10 17:00 UTC] We restarted all nodes in the database cluster, resulting in a successful election.
  • [2019-07-10 17:02 UTC] The Stripe API fully recovered.

Our team identified forcing the election of a new primary as the fastest remediation available, but this required restarting the database cluster. 

The “but” in the sentence above suggests that restarting the database cluster was not an ideal remediation strategy, but there isn’t a rationale given for why it isn’t. It’s not clear from the timeline how long it took to reboot all of the database nodes: it looks like it could be 2 minutes, which sounds pretty quick to me.

Rolling back database version as remediation strategy

  • [2019-07-10 20:13 UTC] During our investigation into the root cause of the first event, we identified a code path likely causing the bug in a new minor version of the database’s election protocol.
  • [2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.

After mitigating user impact, we investigated the root cause and identified a likely code path in a new version of the database’s election protocol. We decided to revert to the previous known stable version for all shards of the impacted cluster.  We deployed this change within four minutes, and until 21:14 UTC the cluster was healthy.

In the moment, rolling back the database version was clearly the rational action to take. Unfortunately for Stripe …. well, see the next contributing factor.

(Also, 4 minutes sounds pretty quick for reverting that database version!)

Recent configuration change to affected shards

 [T]he second period of degradation had a different cause: our revert to a known stable version interacted poorly with a recently-introduced configuration change to the production shards. This interaction resulted in CPU starvation on all affected shards.

We don’t have additional information about this configuration change: presumably it happened after the new version of the database had been deployed.

Second, novel failure mode had same symptoms as the first failure mode

We initially assumed that the same issue had reoccurred on multiple shards, as the symptoms appeared the same as the earlier event. We therefore followed the same mitigation playbook that succeeded earlier.

Mitigators

Unfortunately, there’s not much detail in the writeup to identify the mitigating factors that were in play here, which is a shame, because it sounds like Stripe was able to employ a lot of expertise in order to effectively diagnose and remediate the problems they encountered. And there’s just as much that we can learn from what went right.

Monitoring quickly detected a database problem

Automated monitoring detected the failed election within a minute.

Quick engagement

We began incident response within two minutes.

Risks

Gray failure (sensor problem)

These nodes stopped emitting metrics reporting their replication lag but continued to respond as healthy to active checks.

The stalled database nodes passed their health checks. This is a classic example of a gray failure, where there’s some internal failure but it isn’t detected by the internal failure detector.

Gray failures are pernicious because it’s very difficult (perhaps impossible?) to design a system that can handle failures that it cannot detect. These can also be hard to diagnose, because some of the sensors we are depending on to tell us about the state of the world are not giving us the complete story. We have to depend on integrating multiple sources of data, none of which are every completely reliable.

Service with many dependents goes latent

Because of widespread use of this shard across applications, including the API, the unavailability of this shard starved compute resources for the API and cascaded into a severe API degradation.

It’s very difficult to reason about how a distributed system behaves when one of the services goes latent (that’s one of the value propositions of chaos engineering approaches, like ChAP). In circumstances when a service has multiple dependencies, a latency increase can ripple across the system with dire consequences.

Interaction vulnerabilities that involve rare events

As part of the upgrade, we performed thorough testing in our quality assurance environment, and executed a phased production rollout, starting with less critical clusters and moving on to increasingly critical ones. The new version operated properly in production for the past three months, including many successful failovers. However, the new version also introduced a subtle fault in the database’s failover system that only manifested in the presence of multiple stalled nodes. 

Phased rollouts are a great way to build confidence when rolling out new changes. However, in some cases the condition that will trigger a failure mode doesn’t occur often enough to be caught during a phased rollout process.

In this case, the triggering event was when multiple nodes were stalled. That was an uncommon enough event that it didn’t happen during the phased deployment.

Remediations can introduce new failure modes

Incident reviews generally produce action items that are intended to ensure that the same problem doesn’t recur. The risk with these remediations is that they introduce entirely new problems. In this particular case, the database rollback, a remediation action item, introduced a new failure mode.

We should remediate known problems! But we should also always be mindful that focusing too much on “let’s make sure this failure mode can never happen again” can crowd out questions like “how might these proposed remediations lead to new failure modes?”

(I don’t fault Stripe for their actions in this case: I’m quite certain I would have taken the same action as they did in rolling back the database version to the last known good one).

Stripe outlined the following remediation actions going forward.

We are also introducing several changes to prevent failures of individual shards from cascading across large fractions of API traffic. This includes additional circuit-breaking on failed operations to particular clusters, including the one implicated in these events. We will also pursue additional fault isolation techniques to contain the impact of a single failed shard and limit resource consumption by clients attempting repeated retries of failed requests.

It’s not hard to imagine that these new circuit breaker, fault isolation, and resource consumption limiting strategies may new and even more complex failure modes.

Same symptoms, different problem

  • [2019-07-10 16:35 UTC] The first period of degradation started when the primary node for the database cluster failed.
  • [2019-07-10 17:02 UTC] The Stripe API fully recovered.
  • [2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.
  • [2019-07-10 21:14 UTC] We observed high CPU usage in the database cluster. The Stripe API started returning errors for users, marking the start of a second period of severe degradation.

Anyone who has done operations work before will tell you that if you get paged the same day with the same symptoms, you are going to assume it is a reoccurrence of the issue you just remediated. And, usually, it is. But sometimes it isn’t, and that’s what happened in this case.

Rollback leads to unexpected interaction

You can never really roll back a distributed system to a previous state. A rollback, like any other kind of change, can have unexpected consequences due to interactions with other parts of the system that have since changed. It’s easy to forget this, especially since rollbacks are usually effective as a remediation strategy!

Unanswered questions

Here are some questions I had that aren’t addressed in the writeup.

What was the rationale for the original migration?

The writeup doesn’t describe the rationale for upgrading the databases to a new minor version in the first place. Was it to fix an ongoing issue? To leverage a new feature? Good hygiene in keeping versions up to date?

How did they identify that nodes were stalled?

Did the identification of stalled nodes happen in-the-moment, or was this part of post-incident investigation? How did they diagnose that nodes were stalled?

How did they diagnose the failure mode?

How did they figure out that the failure mode was an interaction between stalled nodes and the new version of the database?

Were there risks associated with restarting the entire database cluster?

In hindsight, restarting the database cluster was the right thing to do. How did things look in the moment?

What database are they using?

The writeup doesn’t say which database they were using, and which versions they were running and upgraded to.

What was the problem with the database node health checks?

These nodes stopped emitting metrics reporting their replication lag but continued to respond as healthy to active checks.

What led to the database node health checks reporting healthy for stalled nodes?

How did the two database nodes stall?

Two nodes became stalled for yet-to-be-determined reasons.

Stripe didn’t know the answer to this question at the time of the writeup.

What was the nature of the configuration change?

[O]ur revert to a known stable version interacted poorly with a recently-introduced configuration change to the production shards. 

Stripe doesn’t provide any details on the nature of this configuration change . What was changed? What was the rationale for the change? When it did it happen?

How did the configuration change interact with the database version rollback?

This interaction resulted in CPU starvation on all affected shards.

The writeup doesn’t provide any details into the nature of the CPU starvation other than what is written above.

How did they diagnose the configuration change as a contributing factor?

 Once we observed the CPU starvation, we were able to investigate and identify the root cause. 

How did they investigate this? Where did they look? How did they trace it back to the config change?

Final thought

The database upgrade was normal work

Three months ago, we upgraded our databases to a new minor version. As part of the upgrade, we performed thorough testing in our quality assurance environment, and executed a phased production rollout, starting with less critical clusters and moving on to increasingly critical ones. The new version operated properly in production for the past three months, including many successful failovers.

It doesn’t sound like there was anything atypical about the way they rolled out the new database version. Incidents often happen as a result of normal work! Even more often, though, incidents don’t happen as a result of normal work.

For the rollback, my impression from the writeup is that it was rolled out more quickly than normal database version changes, as a remediation for a known problem for a critical database shard.

What Deming got wrong

One of my Father’s Day presents this year was The Essential Deming, an anthology of Deming’s shorter writings. I thoroughly enjoyed Deming’s Out of the Crisis and was looking to read more from him.

Reading this book, I was surprised to discover that Deming was opposed to workers training workers, which he considered a faulty practice. The most effective way to become an expert is through the apprenticeship model, where a novice works alongside an expert and directly observes how the expert does their work. That Deming would reject this model, and would believe that an outsider could more effectively train a worker than someone who actually does the day-to-day work is, frankly, bizarre.

Deming also asserted that the most effective teachers at a university were the professors that were the best researchers. This also seemed to me to be an extremely odd claim, and one I’m extremely skeptical of.

Deming had a deep understanding of systems thinking, and the importance of holistic, expert judgment (he used the term leadership) over chasing metrics, and that shines through in this book. However, while Deming seemed to recognize the value of expertise, he did not seem to have a good understanding of how people acquire it.

Why incidents can’t be monocausal

When an incident happens, the temptation is strong to identify a single cause. It’s as if the system is a chain, and we’re looking for the weak link that was responsible for the chain breaking. But, in organizations that are going concerns, that isn’t how the system works. It can’t be, because there are simply too many things that can and do go wrong. Think of all the touch points in your system, how many opportunities there are for problems (bugs, typo in config, bad data, …). If any one of these was enough to take down the system, then it would be like a house of cards, falling down all of the time.

What happens in successful organizations as that the system evolves layers of defense, so that it can survive the kinds of individual problems that are always cropping up. Sure, the system still goes down, and more often than we would like. But the uptime is good enough that the company continues to survive.

Here’s an analogy that I’m borrowing from John Allspaw. Think about a significant new feature or service that your organization delivered successfully: one that took multiple quarters and required the collaboration of multiple teams. I’d wager that there were many factors that contributed to the success of this effort. Imagine if someone asked you: “what was the root cause for the success of this feature?”

So it is with incidents. Because an organization can’t prevent the occurrence of individual problems, the system evolves defenses to protect itself, created by the everyday work of the people in the company. Sure, the code we write might not even compile on the first try, but somehow the code that made it out to production is running well enough that the company is still in business. People are doing checks on the system all of the time, and most of this work is invisible.

For an incident to happen, multiple factors must have contributed to penetrate those layers of defenses that have evolved. I say that with confidence, because if a single event could take your system down, then it never would have made it this far to begin with. That’s why, when you dig into an incident, you’ll always find those multiple contributors.