Poor Cloudflare! It was less than a month ago that they suffered a major outage (I blogged about that here), and then, yesterday, they had another outage. This one was much shorter (~25 minutes versus ~190 minutes), but was significant enough for them to do a public write-up. Let’s dive in!
Trying to make it better made it worse
As part of our ongoing work to protect customers who use React against a critical vulnerability, CVE-2025-55182, we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications, to make sure as many customers as possible were protected.
The work that triggered this incident was a change they were deploying to protect users against a recently disclosed vulnerability. This is exactly the type of change you would hope your protection-as-a-service vendor would make!
Burned by a system designed to improve reliability
We have a killswitch subsystem as part of the rulesets system which is intended to allow a rule which is misbehaving to be disabled quickly…
We have used this killswitch system on a number of occasions in the past to mitigate incidents and have a well-defined Standard Operating Procedure, which was followed in this incident.However, we have never before applied a killswitch to a rule with an action of “execute”. When the killswitch was applied … an error was … encountered while processing the overall results of evaluating the ruleset… Lua returned an error due to attempting to look up a value in a nil value.
(emphasis added)
I am a huge fan of killswitches as a reliability feature. Systems are always going to eventually behave in ways you didn’t expect, and when that happens, the more knobs you have available to you to alter the behavior of the system at runtime, the better positioned that you will be to deal with an unexpected situation. This blog post calls out how the killswitch has helped them in the past.
But all reliability features add complexity, and that additional complexity can create new and completely unforeseen failure modes. Unfortunately, such is the nature of complex systems.
We don’t eliminate the risks, we trade them off
We made an unrelated change that caused a similar, longer availability incident two weeks ago on November 18, 2025. In both cases, a deployment to help mitigate a security issue for our customers propagated to our entire network and led to errors for nearly all of our customer base. (emphasis added)
In all of our decisions and actions, we’re making risk tradeoffs, even if we’re not explicitly aware of it. We’ve all faced this scenario when applying a security patch for an upstream dependency. This sort of take-action-to-block-a-vulnerability is a good example of a security vs. reliability trade-off.
“Fail-Open” Error Handling: As part of the resilience effort, we are replacing the incorrectly applied hard-fail logic across all critical Cloudflare data-plane components. If a configuration file is corrupt or out-of-range (e.g., exceeding feature caps), the system will log the error and default to a known-good state or pass traffic without scoring, rather than dropping requests.
There are many examples of the security vs. reliability trade-off, such as “do we fail closed or fail open?” (explicitly called out in this post!) or “who gets access by default to the systems that we might need to make changes to during an incident”?
Even more generally, there are the twin risks of “taking action” and “not taking action”. People like to compare doing software development on a running system to rebuilding the plane while we’re flying it, a simile which infuriates my colleague (and amateur pilot) J. Paul Reed. I think performing surgery is actually a better analogy. We’re making changes to a complex system, but the reason we’re making changes is that we believe the risk of not making changes is even higher.
I don’t think it’s a coincidence that the famous treatise How complex systems fail was written by an anesthesiologist, and that one of his observations in that paper was that all practitioner actions are gambles.
The routine as risky
The triggering change here was an update to the ruleset in their Web Application Firewall (WAF) product. Now, we can say with the benefit hindsight that there was risk in making this change, because it led to an incident! But we can also say that there was risk in making this change, because there is always risk with making any change. As I’ve written previously, any change can break us, but we can’t treat every change the same.
How risky was this change understood to be in advance? There isn’t enough information in the writeup to determine this; I don’t know how frequently they make these sorts of rule changes to their WAF product. But, given the domain that they work in, I suspect rule changes happen on a regular basis, and presumably this was seen as a routine sort of ruleset change. We all do our best to assess the risk of a given change, but our own internal models of risk are always imperfect. We can update them over time, but there will always be gaps, and those gaps will occasionally bite us.
Saturation rears its ugly head once again
Cloudflare’s proxy buffers HTTP request body content in memory for analysis. Before today, the buffer size was set to 128KB...
…we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications, to make sure as many customers as possible were protected.
During rollout, we noticed that our internal WAF testing tool did not support the increased buffer size. As this internal test tool was not needed at that time and had no effect on customer traffic, we made a second change to turn it off...
Unfortunately, in our FL1 version of our proxy, under certain circumstances, the second change of turning off our WAF rule testing tool caused an error state that resulted in 500 HTTP error codes to be served from our network.
(emphasis added)
I’ve written frequently about saturation as a failure mode in complex systems, where some part of the system gets overloaded or otherwise reaches some sort of limit. Here, the failure mode was not saturation: it was a logic error in a corner case that led to the Lua equivalent of null pointer exceptions, resulting in 500 errors being thrown.
Fascinatingly, saturation still managed to play a role in this incident. Here, there was an internal testing tool that became, in some sense, saturated: it couldn’t handle the 1MB buffer size that was needed for this analysis.
As a consequence, Cloudflare engineers turned off the tool, and it was the turning off of their WAF rule testing tool that triggered the error state.
Addressing known reliability risks takes time
In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur…
We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet. (emphasis added)
I wrote above about that our risk models are never perfectly accurate. However, even in the areas where our risk models are accurate, that still isn’t sufficient for mitigating risk! Addressing known risks can involve substantial engineering effort. Rolling out the new tech takes time, and, as I’ve pointed out here, the rollout itself also carries risk of triggering an outage!
This means that we should expect to frequently encounter incidents that involve a known risk where there was work in-flight to address this risk, but that this work had not yet landed before the risk manifested once again as an incident.
I wrote about this previously in a post titled Cloudflare and the infinite sadness of migrations.
There’s nothing more random than clusters
This is a straightforward error in the code, which had existed undetected for many years. (emphasis added)
Imagine I told you that there was an average of 12 plane crashes a year, and that crash events were independent, and that they were uniformly distributed across the year (i.e., they were just as likely to happen on one day as any other day). What is the probability of there being exactly one plane crash each month?
Let’s simplify the math by assuming each month is of equal length. It turns out that this probability is
The numerator is the number of different ways you can order 12 crashes so that there’s one in each month, and the denominator is all possible ways you can distribute 12 crashes across 12 months.
This means that the probability that there’s at least one month with multiple crashes in 99.95%. You’ll almost never get a year with exactly one crash a month, even though one crash a month is the average. Now, we humans will tend to look at one of these months with multiple crashes and assume that something has gotten worse, but having multiple crashes in at least one month in a year is actually the overwhelmingly likely thing to happen.
I bring this up because Cloudflare just had two major incidents within about a month of each other. This will almost certainly lead people to speculate about “what’s going on with reliability at Cloudflare???” Now, I don’t work at Cloudflare, and so I have no insight into whether there are genuine reliability issues there or not. But I do know that clusters of incidents are very likely to happen by random chance alone, and I think it would be a mistake to read too much into the fact that they had back-to-back incidents.