Quick takes on the recent OpenAI public incident write-up

OpenAI recently published a public writeup for an incident they had on December 11, and there are lots of good details in here! Here are some of my off-the-cuff observations:

Saturation

With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in most of our large clusters.

The term saturation describes the condition where a system has reached the limit of what it can handle. This is sometimes referred to as overload or resource exhaustion. In the OpenAI incident, it was the Kubernetes API servers saturated because they were receiving too much traffic. Once that happened, the API servers no longer functioned properly. As a consequence, their DNS-based service discovery mechanism ultimately failed.

Saturation is an extremely common failure mode in incidents, and here OpenAI provides us with yet another example. You can also read some previous posts about public incident writeups involving saturation: Cloudflare, Rogers, and Slack.

All tests pass

The change was tested in a staging cluster, where no issues were observed. The impact was specific to clusters exceeding a certain size, and our DNS cache on each node delayed visible failures long enough for the rollout to continue.

One reason why it’s difficult to prevent saturation-related incidents is because all of the software can be functionally correct, in the sense that it passes all of the functional tests and that the failure mode only rears its ugly head once the system is exposed to conditions that only occur in the production environment. Even canarying with production traffic can’t prevent problems that only occur under full load.

Our main reliability concern prior to deployment was resource consumption of the new telemetry service. Before deployment, we evaluated resource utilization metrics in all clusters (CPU/memory) to ensure that the deployment wouldn’t disrupt running services. While resource requests were tuned on a per cluster basis, no precautions were taken to assess Kubernetes API server load. This rollout process monitored service health but lacked sufficient cluster health monitoring protocols.

It’s worth noting that the engineers did validate the change in resource utilization on the clusters where the new telemetry configuration was deployed. The problem was an interaction: it increased load on the API servers, which brings us to the next point.

Complex, unexpected interactions

This was a confluence of multiple systems and processes failing simultaneously and interacting in unexpected ways.

When we look at system failures, we often look for problems in individual components. But in complex systems, identifying the complex, unexpected interactions can yield better insights into how failures happens. You don’t just want to look at the boxes, you also want to look at the arrows.

In short, the root cause was a new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS-based service discovery.

So, we rolled out the new telemetry service, and, yada yada yada, our services couldn’t call each other anymore.”

In this case, the surprising interaction was between a failure of the kubernetes API and the resulting failure of services running on top of kubernetes. Normally, if you have services that are running on top of kubernetes and your kubernetes API goes unhealthy, your services should still keep running normally, you just can’t make changes to your current deployment (e.g., deploy new code, change the number of pods). However, in this case, a failure in the kubernetes API (control plane) ultimately led to failures in the behavior of running services (data plane).

The coupling between the two? It was DNS.

DNS

In short, the root cause was a new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS-based service discovery.

Impact of a change is spread out over time

DNS caching added a delay between making the change and when services started failing.

One of the things that makes DNS-related incidents difficult to deal with is the nature of DNS caching.

When the effect of a change is spread out over time, this can make it more difficult to diagnose what the breaking change was. This is especially true when the critical service that stopped working (in this case, service discovery) was not the thing that was changed (telemetry service deployment).

DNS caching made the issue far less visible until the rollouts had begun fleet-wide.

In this case, the effect was spread out over time because of the nature of DNS caching. But often we intentionally spread out a change over time because we want to reduce the blast radius if the change we are rolling out turns out to be a breaking change. This works well if we detect the problem during the rollout. However, this can also make it harder to detect the problem, because the error signal is smaller (by design!). And if we only detect the problem after the rollout is complete, it can be harder to correlate the change with the effect, because the change was smeared out over time.

Failure mode makes remediation more difficult

 In order to make that fix, we needed to access the Kubernetes control plane – which we could not do due to the increased load to the Kubernetes API servers.

Sometimes the failure mode that breaks systems that production depends upon also breaks systems that operators depend on to do their work. I think James Mickens said it best when he wrote:

I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS

Facebook encountered similar problems when they experienced a major outage back in 2021:

And as our engineers worked to figure out what was happening and why, they faced two large obstacles: first, it was not possible to access our data centers through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this. 

This type of problem often requires that operators improvise a solution in the moment. The OpenAI engineers pursued multiple strategies to get the system heathy again.

We identified the issue within minutes and immediately spun up multiple workstreams to explore different ways to bring our clusters back online quickly:

  1. Scaling down cluster size: Reduced the aggregate Kubernetes API load.
  2. Blocking network access to Kubernetes admin APIs: Prevented new expensive requests, giving the API servers time to recover.
  3. Scaling up Kubernetes API servers: Increased available resources to handle pending requests, allowing us to apply the fix.

By pursuing all three in parallel, we eventually restored enough control to remove the offending service.

Their interventions were successful, but it’s easy to imagine scenarios where one of these interventions accidentally made things even worse. As Richard Cook noted: all practitioner actions are gambles. Incidents always involve uncertainty in the moment, and it’s easy to overlook this when we look back with perfect knowledge of how the events unfolded.

A change intended to improve reliability

As part of a push to improve reliability across the organization, we’ve been working to improve our cluster-wide observability tooling to strengthen visibility into the state of our systems. At 3:12 PM PST, we deployed a new telemetry service to collect detailed Kubernetes control plane metrics.

This is a great example of unexpected behavior of a subsystem whose primary purpose was to improve reliability. This is another data point for my conjecture on why reliable systems fail.

2 thoughts on “Quick takes on the recent OpenAI public incident write-up

Leave a comment