Contributors, mitigators & risks: Cloudflare 2019-07-02 outage

John Graham-Cumming, Cloudflare’s CTO, wrote a detailed writeup of a Cloudflare incident that happened on 2019-07-02. Here’s a categorization similar to the one I did for the Stripe outage.

Note that Graham-Cumming has a “What went wrong” section in his writeup where he explicitly enumerates 11 different contributing factors; I’ve sliced things a little differently here: I’ve taken some of those verbatim, reworded some of them, and left out some others.

All quotes from the original writeup are in italics.

Contributing factors

Remember not to think of these as “causes” or “mistakes”. They are merely all of the things that had to be true for the incident to manifest, or for it to be as severe as it was.

Regular expression lead to catastrophic backtracking

A regular expression used in a firewall engine rule resulted in catastrophic backtracking:

(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

Simulated rules run on same nodes as enforced rules

This particular change was to be deployed in “simulate” mode where real customer traffic passes through the rule but nothing is blocked. We use that mode to test the effectiveness of a rule and measure its false positive and false negative rate. But even in the simulate mode the rules actually need to execute and in this case the rule contained a regular expression that consumed excessive CPU.

Failure mode prevented access to internal services

But getting to the global WAF [web application firewall] kill was another story. Things stood in our way. We use our own products and with our Access service down we couldn’t authenticate to our internal control panel … And we couldn’t get to other internal services like Jira or the build system.

Security feature disables credentials for infrequent use for an operator interface

[O]nce we were back we’d discover that some members of the team had lost access because of a security feature that disables their credentials if they don’t use the internal control panel frequently

Bypass mechanisms not frequently used

And we couldn’t get to other internal services like Jira or the build system. To get to them we had to use a bypass mechanism that wasn’t frequently used (another thing to drill on after the event). 

WAF changes are deployed globally

The diversity of Cloudflare’s network and customers allows us to test code thoroughly before a release is pushed to all our customers globally. But, by design, the WAF doesn’t use this process because of the need to respond rapidly to threats … Because WAF rules are required to address emergent threats they are deployed using our Quicksilver distributed key-value (KV) store that can push changes globally in seconds

The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.

The fact that WAF changes can only be done globally exacerbated the incident by increasing the size of the blast radius.

WAF implemented in Lua, which uses PCRE

Cloudflare makes use of Lua extensively in production … The Lua WAF uses PCRE internally and it uses backtracking for matching and has no mechanism to protect against a runaway expression.

The regular expression engine being used didn’t have complexity guarantees.

Based on the writeup, it sounds like they used the PCRE regular expression library because PCRE is the regex library that ships with Lua, and Lua is the language they use to implement the WAF.

Protection accidentally removed by a performance improvement refactor

A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.

Cloudflare dashboard and API are fronted by the WAF

Our customers were unable to access the Cloudflare Dashboard or API because they pass through the Cloudflare edge.

Mitigators

Paging alert quickly identified a problem

At 13:42 an engineer working on the firewall team deployed a minor change to the rules for XSS detection via an automatic processThree minutes later the first PagerDuty page went out indicating a fault with the WAF. This was a synthetic test that checks the functionality of the WAF (we have hundreds of such tests) from outside Cloudflare to ensure that it is working correctly. This was rapidly followed by pages indicating many other end-to-end tests of Cloudflare services failing, a global traffic drop alert, widespread 502 errors and then many reports from our points-of-presence (PoPs) in cities worldwide indicating there was CPU exhaustion.

Engineers recognized high severity based on alert pattern

This pattern of pages and alerts, however, indicated that something gravely serious had happened, and SRE immediately declared a P0 incident and escalated to engineering leadership and systems engineering.

Existence of a kill switch

At 14:02 the entire team looked at me when it was proposed that we use a ‘global kill’, a mechanism built into Cloudflare to disable a single component worldwide.

Risks

Declarative program performance is hard to reason about

Regular expressions are examples of declarative programs (SQL is another good example of a declarative programming language). Declarative programs are elegant because you can specify what the computation should do without needing to specify how the computation can be done.

The downside is that it’s impossible to look at a declarative program and understand the performance implications, because there isn’t enough information in a declarative program to let you know how it will be executed! You have to be familiar with how the interpreter/compiler works to understand the performance implications of a declarative program. Most programmers probably don’t know how regex libraries are implemented.

Simulating in production environment

For rule-based systems, it’s enormously valuable for an engineer to be able to simulate what effect the rules will have before they’re put into effect, as it is generally impossible to reason about their impacts without doing simulation.

The more realistic the simulation is, the more confidence we have that the results of the simulation will correspond to the actual results when the rules are enabled in production.

However, there is always a risk of doing the simulation in the production environment, because the simulation is a type of change, and all types of change carry some risk.

Severe outages happen infrequently

[S]ome members of the team had lost access because of a security feature that disables their credentials if they don’t use the internal control panel frequently.

To get to [internal services] we had to use a bypass mechanism that wasn’t frequently used (another thing to drill on after the event). 

The irony is that when we only encounter severe outages infrequently, we don’t have the opportunity to exercise the muscles we need to use when these outages do happen.

Large blast radius

The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.

In the future, it sounds like non-emergency rule changes will be staged at Cloudflare. But the functionality will still exist for global changes, because it needs to be there for emergency rule changes. They can reduce the amount of changes that need to get pushed globally, but they can’t drive it down to zero. This is an inevitable risk tradeoff.

Questions

Why hasn’t this happened before?

You’re not generally supposed to ask “why” questions, but I can’t resist this one. Why did it take so long for this failure mode to manifest? Hadn’t any of the engineers at Cloudflare previously written a rule that used a regex with pathological backtracking behavior? Or was it that refactor that removed their protection from excessive CPU load in the case of regex backtracking?

What was the motivation for the refactor?

A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.

What was the reason they were trying to make the WAF use less CPU? Were they trying to reduce the cost by running on fewer nodes? Were they just trying to run cooler to reduce the risk of running out of CPU? Was there some other rationale?

What’s the history of WAF rule deployment implementation?

The SOP allowed a non-emergency rule change to go globally into production without a staged rollout. [emphasis added]

The WAF rule system is designed to support quickly pushing out rules globally to protect against new attacks. However, not all of the rules require quick global deployment. Yet, this was the only mechanism that the WAF system supported, even though code changes support staged rollout.

The writeup simply mentions this as a contributing factor, but I’m curious as to how the system came to be that way. For example, was it originally designed with only quick rule deployment in mind? Were staged code deploys introduced into Cloudflare only after the WAF system was built?

Other interesting notes

The WAF rule update was normal work

At 13:42 an engineer working on the firewall team deployed a minor change to the rules for XSS detection via an automatic process. 

Based on the writeup, it sounds like this was a routine change. It’s important to keep in mind that incidents often occur as a result of normal work.

Multiple sources of evidence to diagnose CPU issue

The Performance Team pulled live CPU data from a machine that clearly showed the WAF was responsible. Another team member used strace to confirm. Another team saw error logs indicating the WAF was in trouble.

It was interesting to read how they triangulated on high CPU usage using multiple data sources.

Normative language in the writeup

Emphasis added in bold.

We know how much this hurt our customers. We’re ashamed it happened.

The rollback plan required running the complete WAF build twice taking too long.

The first alert for the global traffic drop took too long to fire.

We didn’t update our status page quickly enough.

Normative language is one of the three analytical traps in accident investigation. If this was an internal writeup, I would avoid the language criticizing the rollback plan, the alert configuration, and the status page decision, and instead I’d ask questions about how these came to be, such as:

Was this the first time the rollback plan was carried out? (If so, that may explain the reason why it wasn’t known how long it would take).

Is the global traffic drop alert configuration typical of the other alerts, or different? If it’s different (i.e., other alerts fire faster?) what led to it being different? If it’s similar to other alert configurations, that would explain why it was configured to be “too long”.

Work to reduce CPU usage contributed to excessive CPU usage

A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.

These sorts of unintended consequences are endemic when making changes within complex systems. It’s an important reminder that the interventions we implement to prevent yesterday’s incidents from recurring may contribute to completely new failure modes tomorrow.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s