Thoughts on the Bluesky public incident write-up

Back on April 4, the social media site Bluesky suffered a pretty big outage. I was delighted to discover that one of their engineers, Jim Calabro, published a public writeup about it: April 2026 Outage Post-Mortem.

Calabro’s post goes into a lot of technical details about the failure mode. I’m using this post as a learning exercise for myself. I find that if I have to explain something, then I’ll understand it better. After reading his post and writing this one, I learned things about ephemeral ports, goroutine groups, the TCP state machine, the interaction between blocking system calls and the creation of threads in the Go runtime, and the range of loopback addresses on Linux.

Interpreting the error message

The first thing that struck me is Calabro’s write-up was his discussion of a particular error message he saw in the logs:

dial tcp 127.32.0.1:0->127.0.0.1:11211: bind: address already in use

Now, if I was the one who saw the error message “bind: address already in use”, I would have assumed that a process was trying to listen on a port that another process was already listening on. This sort of thing is server-side behavior, where a server listens on a port (e.g., web servers listen on port 80 and port 443). In the connect attempt associated with the log, the server is listening on port 11211 (the standard port used by memcached). As it says on the Linux bind man page:

 EADDRINUSE
The given address is already in use.

But that wasn’t the problem in this case! It wasn’t an issue with a server trying and failing to listen on port 11211. Instead, the problem is that the client, which is trying to make a connection to the memcached service, is failing to associate a socket with a port. The system call that’s failing is not listen but (as indicated in the error message) bind. That bind man page actually has two different entries for the address already in use error. Here’s the second one:

EADDRINUSE
(Internet domain sockets) The port number was specified as
zero in the socket address structure, but, upon attempting
to bind to an ephemeral port, it was determined that all
port numbers in the ephemeral port range are currently in
use. See the discussion of
/proc/sys/net/ipv4/ ip_local_port_range ip(7).

I assume that go’s net.Dial function ultimately calls this private dial function, which will call bind if the caller explicitly specifies the local address. In the log message above, the local address was 127.32.0.1:0.

This code was failing because there were no available ephemeral ports left!

I bring this up because Calabro simply mentions as an aside how he (correctly!) interpreted the error message. He just shows the error, and then writes (emphasis mine):

The timing of these log spikes lined up with drops in user-facing traffic, which makes sense. Our data plane heavily uses memcached to keep load off our main Scylla database, and if we’re exhausting ports, that’s a huge problem.

That’s expertise in action!

Saturation, part 1: ephemeral ports

The failure mode that Bluesky encountered is a classic example of saturation, where the system runs out of a critical resource. Calabro’s write-up covers two different time periods, a paging alert on Saturday April 4, and then the Bluesky outage that happened two days later, on Monday April 6. There were different flavors of saturation on the different days, here we’ll talk about the first one.

On Saturday, the limited resource in question was the number of available ephemeral ports. From a programming perspective, when we make calls to servers, we don’t think about the fact that our side of a TCP connection gets assigned a port, because this TCP detail is effectively abstracted away from the developer.

I’m running on macOS, but if I launch an Ubuntu Docker container, I can see that the ephemeral port range goes from 32768 to 60999, for a count of 28,232 available ephemeral ports:

$ sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768 60999

The irony here is that the connections that exhausted the ephemeral ports were to a process that’s running on the same host: memcached listening on 127.0.0.1:11211.

Calabro goes into considerable detail about how the service they refer to as the data plane ran out of ephemeral ports. I’ll describe my understanding based on his write-up. But, as always, I recommend you read the original.

The data plane service talks to a database that is fronted by memcached. This incident only involved interactions between data plane and memcached, so I don’t show the database in the diagram below.

How the data plane service ran out of ephemeral ports

Bluesky recently brought up a new internal service. One of the things this service does is make the GetPostRecord RPC call against the data plane service. The problem isn’t with the rate of traffic. In fact, the volume of traffic that this internal service sends to data plane is low, less than 3 RPS.

No, the problem here is the size of the GetPostRecord payload. It sends a batch of URIs in each call, and sometimes those batches are very large, on the order of 15-20 thousand URIs.

The data plane looks up each URI in memcache first before hitting the database. The data plane is written in Go, and for each request, it starts a new goroutine, and each of those goroutines creates a new TCP connection to memcache. All of those goroutines concurrently making those TCP connections depleted the set of available ephemeral ports.

One thing I learned from this write-up is that Go has a notion of goroutine groups, you can explicitly set a limit of the number of goroutines that are active within a given group. Tragically, this was the one data plane endpoint that was missing an explicit limit.

The connection pool

In the write-up, Calabro notes that the memcached client uses a connection pool, with a maximum idle size of 1000 connections. I was initially confused by this, because I’m used to connection pools where the pool defines the maximum number of simultaneous active connections, and if no unused connections are available, then the client blocks waiting for a connection to be available.

I looked into this, and assuming that this app is using the gomemcache library, that’s not how its connection pool works. Instead, the gomemcache code first looks to see if there’s an available connection. If not, it creates a new connection. So, the connection pool here doesn’t bound connections, but rather is an optimization to reuse an existing connection if one is available.

Instead, what you specify with gomemcache is the maximum number of idle connections, which is the maximum number of connections that the pool will hold onto after use. As mentioned above, Bluesky had this configured as 1,000. This means that if there are 15,000 new connections requested concurrently, at best 1,000 connections will be reused from the pool, requiring 14,000 new connections to be established.

Bitten by time lags – TIME_WAIT

Time lags are underrated factor in incidents, and time lag plays a role here. In this case, the time lag is due to a state in the lifetime of a TCP socket called TIME_WAIT. This state renders a port unusable for a fixed period of time after a connection associated with the port has been closed.

Personally, I first encountered TIME_WAIT back when I was working on a web app on my laptop. Sometimes I’d kill the process and restart it, and the restart would fail with the error that the port it was trying to listen on was already in use. It turns out that the operating system does not immediately release the ports associated with a socket after it’s closed. Instead, the connection transitions to the TIME_WAIT state.

Here’s an explanation for why TIME_WAIT exists, based largely on the excellent article: TIME_WAIT and its design implications for protocols and scalable client server systems from ServerFramework.com.

The dropped ACK problem: sending an error when nothing is wrong

Closing a TCP requires each send side to send a FIN, and each side to ACK the received FIN. As each side sends or receives one of these packets, it transitions through the TCP state machine. Here’s what the exchange looks like. I’ve annotated the TCP states on the server side and the client side.

What state should the client be in after receiving the FIN?

It looks like the client should also be in the CLOSED state after it receives the FIN. However, that creates a problem if the ACK it sends never makes it, because the server will eventually retry sending the FIN.

Here the client has received a packet associated with a TCP connection that has transitioned to the CLOSED state. The client will treat this as an error, and will send an RST packet (if you’ve ever seen the message: connection reset by peer, you’ve been on the receiving end of an RST packet).

To prevent this, after sending an ACK in the FIN_WAIT_2, the client transitions into the TIME_WAIT state. From RFC-9293:

When a connection is closed actively, it MUST linger in the TIME-WAIT state for a time 2xMSL (Maximum Segment Lifetime)

The RFC doesn’t define what the maximum segment lifetime is. On Linux, the kernel waits in the TIME_WAIT state for about 60 seconds.

#define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT
* state, about 60 seconds */

This means that the state of the TCP connection will be in the TIME_WAIT state for about a minute before transitioning to CLOSED:

The out of order problem: packet associated with wrong connection

TIME_WAIT also deals with a problem related to packets being received out of order.

Note that a TCP connection’s identity is determined by the four-tuple: (source IP, source port, destination IP, destination port). Here’s an example of such a four-tuple: (127.32.0.1, 32768, 127.0.0.1, 11211).

Because TCP packets can arrive out of order, there might still be packets in-flight associated with that connection. If a new TCP connection with the same four-tuple is opened, the receiver will incorrectly associate the packet with the new connection, even though it was part of the old one, as depicted below (here I’m simplifying the connect and close to a single packet rather than using three packets).

The blue “send” packet is incorrectly associated with the green TCP connection.

TIME_WAIT also prevents this by having the client enter TIME_WAIT that is long enough to guarantee that the sent packet is received before the new connection can be opened on the same port.

Eating up the ephemeral port space

Because you have to wait about a minute before you can reuse an ephemeral port, TIME_WAIT reduces the amount of available ephemeral ports.

Returning to the Bluesky scenario, imagine that the memcached connection pool is fully populated (there are 1000 idle connections ready to be used), and the rest of the ephemeral ports are free. I’ll depict the space of 28,232 ephemeral ports as a rectangle, with the green rectangle indicating the connection pool.

Next, a wave of 15K connections are created. This takes all 1000 of the idle connections, and has to make 14K new connections.

The maximum idle connections is set to 1000, so 1000 of the active connections get returned to the pool. The rest of the connections are closed, and eventually enter the TIME_WAIT state:

Now, another wave of connection requests comes in. Because the ephemeral ports are in use by TCP connections in the TIME_WAIT state, they’re unavailable:

Once again, 1000 connections get returned to the pool, and the rest enter TIME_WAIT.

You can see how the ephemeral ports could be consumed if large numbers of connection requests came in one after another before the TIME_WAIT timer elapsed.

Saturation, part 2: memory

While Bluesky observed the problem with ephemeral port exhaustion on Saturday, it wasn’t until the Monday that they suffered from an outage.

From the write-up, it’s not clear to me what exactly changed on Monday. Perhaps it was just an organic increase in traffic that exacerbated the problem? Whatever it was, the ephemeral port exhaustion contributed to a cascading failure.

According to the write-up, the failure cascade went something like this:

  1. The ephemeral port exhaustion led to error messages when attempting to call memcached.
  2. Every memcached error resulted in a log line being written synchronously to disk.
  3. A large number of goroutines blocked in synchronous system calls led to the Go runtime spawning many OS-level threads (I learned that OS-level threads are called M in Go parlance).
  4. This large number of OS-level threads put memory pressure on the app.
  5. As a result, the data plane experienced stop-the-world GC pauses as well as OOM kills.

Note that because TIME_WAIT is an OS-level state, a data plane process that was OOM killed and restarted would still face limits on the ephemeral port space!

The workaround: leveraging multiple loopbacks

I was impressed by their improvised solution to deal with the problem. I’ve been talking about how an ephemeral port can be consumed, but it’s not actually the port itself. When calling the bind function, you provide not just a port, but the local IP address you want to bind to. It’s the (IP, port) pair that is limited, not the port.

So, if you want to create a TCP connection to a local process (like, say, memcache), and the pair (127.0.0.1,32768) is already in use, if there are other IP addresses that are loopback addresses, you can use those too!

On Linux, by default, all 127.*.*.* IP addresses are loopback address!


# ip route show table local
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
...

(Note that this is different from macOS, which only routes 127.0.0.1 via loopback by default).

This means that you potentially have access to a much larger space of ephemeral ports!

Applying terminology from resilience engineering, ephemeral ports are a resource, and you have to do work to mobilize these additional resources.

For Bluesky, the work of marshaling resources came in the form of modifying the code that made the TCP connections. They modified it to randomly select a loopback IP address. Here’s the code from the blog post:

// Use a custom dialer that picks a random loopback IP for each connection.
// This avoids ephemeral port exhaustion on a single IP when a container
// restarts (TIME_WAIT sockets from the old process block the fixed IP).
memcachedClient.DialContext = func(ctx context.Context, network, address string) (net.Conn, error) {
ip := net.IPv4(127, byte(1+rand.IntN(254)), byte(rand.IntN(256)), byte(1+rand.IntN(254)))
d := net.Dialer{LocalAddr: &net.TCPAddr{IP: ip}}
return d.DialContext(ctx, network, address)
}

Calabro’s describes the above change as:

The band-aid fix was insane but did the job. 

I wouldn’t describe this is insane, though. This is exactly the kind of improvisational work that you frequently have to do in order to get a system back to healthy during the incident.

Diagnostic challenges

Calabro briefly discusses how difficult it was to diagnose the issue, emphasis mine:

It was all buried in there, but it was hard to know where to look when so much was falling over all at once. You need to have the mental discipline and high granularity in your metrics to be able to cut through the noise to find the real root cause. It’s hard work!

I wish there had been more in this writeup about the process the engineers went through to actually figure out what was going on during the incident, because descriptions of diagnostic work is one of my favorite parts of incident write-ups. We all can stand to do better at improving our diagnostic skills, and one way I try to improve is to read about how someone diagnosed an issue during an incident.

As Calabro mentions, during an incident, there are frequently many things that are failing, and it can be extremely hard to tease out the signals that will help you understand how the system first got into this state.

One particular challenge is noticing an error signal that happens to be unrelated to the ongoing incident, as happened during this incident (emphasis mine):

EDIT: Also, the status page said this was an issue with a 3rd party provider. It was clearly not, apologies for that miscommunication! At the time I posted that status page update, I was looking at some traceroutes that indicated some pretty substantial packet loss from a cloud provider to our data center, but those were not the root cause of the issue.

The messy 9

I want to end this post by bringing up the Messy 9, a set of patterns proposed by the resilience engineering researcher David Woods. These are:

  1. congestion
  2. cascades
  3. conflicts
  4. saturation
  5. lag
  6. friction
  7. tempos
  8. surprises
  9. tangles

I’ve explicitly discussed cascades, saturation, and lag in this post. I suspect that, if we had more detail about this incident, we’d identify even more of these patterns here. Keep on the look-out for these the next time you read an incident write-up or attend an incident review meeting!

Quick thoughts on GitHub CTO’s post on availability

GitHub’s been taking it on the chin on the availability front lately. Yesterday, their CTO, Vlad Fedorov, wrote a post on their blog about their recent incidents: Addressing GitHub’s recent availability issues. This post shares some additional details about three recent incidents. I’ll list them in order that they are mentioned in the post:

  1. Feb. 9, 2026 – involved an overloaded database cluster
  2. Feb. 2, 2026 – involved security policies unintentionally blocking access to VM metadata
  3. Mar. 5, 2026 – involved writes failing on a Redis cluster

First observation: I really appreciate it when a company addresses availability concerns by providing more public details about recent incidents. I always think more of companies that are willing to provide these sorts of details, and I hope GitHub provides even more details about their outages in the future.

Saturation, again and again and again

The first incident is a classic example of saturation. In this case, it was an important database cluster that got overloaded. Because databases are much harder to scale up than stateless services, your best bet when dealing with overload is to figure out how to reduce the load so the database can go healthy again. On the other hand, reducing load means denying requests: a “healthy” database that is taking zero traffic has 0% availability! So it’s a balancing act, and the responders are constrained by the infrastructure that currently exists for selectively limiting traffic. Once the overload happens, you can only twist the knobs that you already have available.

Fedorov notes they’re now prioritizing implementing mechanisms to protect against these sorts of scenarios where load increases unexpectedly.

Protecting downstream components during spikes to prevent cascading failures while prioritizing critical traffic loads.

Taking it to the limit, and then over it

Fedorov also provided details on how they ended up seeing so much more traffic than usual. They released a new model (I think it’s an AI model) on a Saturday, when traffic is lower. And then, on Monday, multiple different factors contributed to an increase in traffic that pushed them over the limit. The blog post mentions these four contributors:

  • new model release
  • they had reduced a user settings cache TTL from 12 hours to 2 hours, increasing write load
  • they hit their regular peak load on Monday
  • many of their users updated to the new version of their client apps, and this update activity increased read load

They had reduced the TTL so that people would get the new model more quickly, but reducing the TTL means that more cache evictions, which meant more database load.

This compounding effect of multiple factors is pernicious, because it can be hard to reason about why your system hit a tipping point. From the write-up:

While the TTL change was quickly identified as a culprit, it took much longer to understand why the read load kept increasing, which prolonged the incident.

Understanding the role of multiple, independent contributing factors is hard enough in a post-incident analysis, identifying this in the heat of an incident can be damn near impossible.

The thing about tipping points is that you don’t notice until you tip

This failure mode was a case where the danger was growing over time, but there were no visible symptoms until they hit the limit.

 The architecture was originally selected for simplicity at a time when there were very few models and very few governance controls and policies related to those models. But over time, something that was a few bytes per user grew into kilobytes. We didn’t catch how dangerous that was because the load was visible only during new model or policy rollouts and was masked by the TTL. 

The resilience engineering folks would call this an example of a brittle collapse, where a system falls over when it hits the limit. We do our best to monitor for trouble and anticipate trouble ahead, but we’re always going to hit scenarios like this where signals of a problem are being masked, until the perfect storm hits. At that point, we just have to be good at responding. And, hopefully, good at learning as well.

Failovers are a different mode of operation

Their February 2nd incident involved a failover where they had some sort of infrastructure issue in one(?) region. GitHub has mechanisms for automatically shifting traffic to healthy regions, and that mechanism worked here, but there was another issue that they hit:

However, in this case, there was a cascading set of events triggered by a telemetry gap that caused existing security policies to be applied to key internal storage accounts affecting all regions. This blocked access to VM metadata on VM creates and halted hosted runner lifecycle operations.

It was the combination of the traffic failover and a telemetry gap that ultimately led to the outage. (Did the automatic traffic shift end up making things worse? I can’t tell from the write-up). The traffic redirection didn’t create the incident, but it enabled it to happen. Whenever our system runs in an alternate mode, there’s an increased risk that we’ll hit some weird edge case that we haven’t seen before because it doesn’t regularly run in that mode. Automated reliability mechanisms often put our systems in these alternate modes. This means that they can enable novel failure modes.

In fact, the March 5th incident followed a similar pattern, this time it was a Redis cluster primary failover enabled the incident.

The failover performed as expected, but a latent configuration issue meant the failover left the cluster in a state with no writable primary.

Reliability vs security, the eternal struggle

The Feb 2nd incident also illustrates the fundamental tradeoff between reliability and security. Reliability’s job is to ensure service access to the users who are supposed to have it. Security’s job is to deny service access to the users that aren’t supposed to have it. These two forces are are in tension, as we see in this incident where a security mechanism denied access.

It’s not just about automation, it’s about more options for responders

In the Feb 9th incident, Fedorov notes how the responders lacked certain functionality that would have helped them mitigate (emphasis mine)

Further, due to the interaction between different services after the database cluster became overwhelmed, we needed to block the extra load further up the stack, and we didn’t have sufficiently granular switches to identify which traffic we needed to block at that level.

He also notes how they had to manually recover from the March 5th incident:

With writes failing and failover not available as a mitigation, we had to correct the state manually to mitigate.

I hope they don’t pull all of their eggs in the “automation” basket in their remediations. For the first incident in particular, automated load shedding is tricky to get right, it’s hard to reason about, and you won’t have experience with the behavior of this new automation until either you have the incident, or until the automation actually creates an incident (e.g., opens a circuit breaker when it shouldn’t). Making it easier for the responders to manually control load shedding during an incident is important as well.

More generally, reliability work isn’t just about putting in automated mechanisms to handle known failure modes. It’s also about setting up the incident responders for success by providing them with as many resources as possible before the next incident happens. In this context, resources means the ability to manually control different aspects of the infrastructure, whether that’s selective traffic blocking, manually updating database state, or many of the other potential remediations that a responder might have to do. The more flexibility they have, the more room to maneuver (to use David Woods’s phrase), the easier it will be for them to improvise a solution, and the faster the next surprising incident will be mitigated.

Grow fast and overload things

The general vibes I see online is that the AI companies have not been doing particularly well in the reliability department. Both OpenAI and Anthropic publish reliability statistics on their status pages. Now, I’m not a fan of using the nines as a meaningful indicator of reliability, but since I don’t have access to any other signals about reliability for these two companies, they’ll have to do for the purposes of this blog post.

Here’s a screenshot of OpenAI’s status page:

Here’s a screenshot of Anthropic’s status page:

And these numbers… well, they’re not great. With the exception of Sora, none of the services at either company makes it to 99.9% of reliability (three nines). Surprisingly, ChatGPT at 98.86% of uptime does not even make it to two nines.

I’ve seen speculation that the reason that reliability isn’t great is that this is a high development velocity phenomenon. Here’s Boris Cherny (the guy at Anthropic who wrote Claude Code) pushing back on that hypothesis.

A few days later, during a ChatGPT incident, I saw this post from Nik Pash at OpenAI:

This isn’t move fast and break things, but rather grow fast and overload things. These companies are in the business of providing LLMs, which are a new capability. Users are leveraging LLMs in new and innovative ways. The resilience engineering researcher David Woods refers to this phenomenon as a florescence to describe this kind of rapid and widespread uptake.

As a consequence of this florescence, the load on the providers increases unexpectedly and dramatically: they weren’t able to predict the load and have struggled to keep up with it when it happens. These LLM providers are running directly into the problem of saturation (plug: check out my recent post on saturation for the Resilience in Software Foundation).

Now, I expect that these companies will get better at recovering from these unexpected increases in load as they gain experience with the problem. Because of capacity constraints with those pricey GPUs, they can’t always scale their way out of these problem, but they can redistribute resources, and they can get better at load shedding and other sorts of graceful degradation to limit the damage of overload. And I bet that’s where they’re both investing in reliability today. At least, I hope so. Because this problem isn’t going to go away. If anything, I suspect their loads will become even more unpredictable as people continue to innovate with LLMs. Because AIs don’t seem to do any better at predicting the future than humans.

Quick takes on Feb 20 Cloudflare outage

Cloudflare just posted a public write-up of an incident that they experienced on Feb. 20, 2026. While it was large enough for them to write it up like this, it looks like the impact is smaller than the previous Cloudflare incidents I’ve written about here. Given that Cloudflare continues to produce the most detailed public incident write-ups in the industry, I still find them insightful. After all, the insight you get from an incident write-up is not related to the size of the impact! Here are some quick observations from this one.

System intended to improve reliability contributed to incident

The specific piece of configuration that broke was a modification attempting to automate the customer action of removing prefixes from Cloudflare’s BYOIP service, a regular customer request that is done manually today. Removing this manual process was part of our Code Orange: Fail Small work to push all changes toward safe, automated, health-mediated deployment.

Cloudflare has been doing work to improve reliability. In this case, they were working to automate a potentially dangerous manual operation to reduce the risk of making changes. Unfortunately, they got bitten by a previously undiscovered bug in the automation.

How do you pass the flag?

When I first read this write-up, I thought the issue was that they had done a query which was supposed to have a scope, but it was missing a scope, and so returned everything. But that’s not actually what happened.

Accidentally missing a scope for a query, resulting system behavior is "match everything", with disastrous consequences. another entry in a never-ending series. (See also: missing WHERE clause in a SQL query) blog.cloudflare.com/cloudflare-o…

Lorin Hochstein (@norootcause.surfingcomplexity.com) 2026-02-22T01:57:40.994Z

(I’ve seen the accidentally unscoped query failure mode multiple times in my career, but that’s not actually what happened here)

Instead, what happened here was that the client meant to set the pending_delete flag when making a query against an API.

Based on my reading, the server expected something like this:


GET /v1/prefixes?pending_delete=true

Instead, the client did this:

GET /v1/prefixes?pending_delete

The server code looked like:

if v := req.URL.Query().Get("pending_delete"); v != "" {
// server saw v=="", so this block wasn't executed
...
return;
}
// this was executed isntead!

It sounds like there was a misunderstanding about how to pass the flag, based on this language in the write-up:

One of the issues in this incident is that the pending_delete flag was interpreted as a string, making it difficult for both client and server to rationalize the value of the flag.

This is a vicious logic bug, because what happened was that instead of returning the entries to be deleted, the server returned all of them.

Cleanup, but still in use

Since the list of related objects of BYOIP prefixes can be large, this was implemented as part of a regularly running sub-task that checks for BYOIP prefixes that should be removed, and then removes them. Unfortunately, this regular cleanup sub-task queried the API with a bug.

This particular failure involved an automated cleanup task, to replace the manual work that a Cloudflare operator previously had to perform to do the dangerous step of removing published IP prefixes. In this case, due to a logic error, active prefixes were deleted.

Here, there was a business requirement to do the cleanup, it was to fulfill a request of a customer to remove prefix. More generally, cleanup itself is always an inherently dangerous process. It’s one of the reasons that code bases can end up such crufty places over time: we might be pretty sure that a particular bit of code, config, or data, is no longer in use. But are we 100% sure? Sure enough to take the risk of deleting it? The incentives generally push people towards a Chesterton’s Fence-y approach of “eh, safer to just leave it there”. The problem is that not cleaning up is also risky.

Reliability work in-flight

As a part of Code Orange: Fail Small, we are building a system where operational state snapshots can be safely rolled out through health-mediated deployments. In the event something does roll out that causes unexpected behavior, it can be very quickly rolled back to a known-good state. However, that system is not in Production today.

Recovery took longer than they would have liked here: full resolution of all of the IP prefixes took about six hours. Cloudflare already had work in progress to remediate problems like this more quickly! But it wasn’t ready yet. Argh!

Alas, this is unavoidable. Even when we are explicitly aware of risks, and we are working actively to address those risks, the work always takes time, and there’s nothing we can do but accept the fact that the risk will be present until our solution is ready.

People adapt to bring the system back to healthy

Affected BYOIP prefixes were not all impacted in the same way, necessitating more intensive data recovery steps… a global configuration update had to be initiated to reapply the service bindings for [a subset of customers that also had service bindings removed] to every single machine on Cloudflare’s edge.

The failure modes were different for different customers. In some cases, customers were able to take action themselves to remediate the issue through the Cloudflare dashboard. There were also more complex cases where Cloudflare engineers had to take action to restore service.

The write-up focuses primarily on the details of the failure mode. It sounds like the engineers had to do some significant work in the moment (intensive data recovery steps) to recover the tougher cases. This is where resilience really comes into play. The write-up hints at the nature of this work (reapply service bindings… to every single machine on Cloudflare’s edge). Was there pre-existing tooling to do this? Or did they have to improvise a solution? This is the most interesting part to me, and I’d love to know more about this work.

Lots of AI SRE, no AI incident management

With the value of AI coding tools now firmly established in the software industry, the next frontier is AI SRE tools. There are a number of AI SRE vendors. In some cases, vendors are adding AI SRE functionality to extend their existing product lineup, a quick online search reveals one such as PagerDuty’s SRE Agents, Datadog’s Bits AI SRE, incident.io’s AI SRE, Microsoft’s Azure SRE Agent, and Rootly’s AI SRE. There are also a number of pure play AI SRE startups: the ones I’ve heard of are Cleric, Resolve.ai, Anyshift.io, and RunWhen. My sense of the industry is that AI SRE is currently in the evaluation phase, compared to the coding tools which are in the adoption phase.

What I want to write about today is not so much what these AI tools do contribute to resolving incidents, but rather what they don’t contribute. These tools are focused on diagnostic and mitigation work. The idea is to try to automate as much as possible the work of figuring out what the current problem is, and then resolving it. I think most of the focus is, rightly, on the diagnostic side at this stage, although I’m sure automated resolution is also something being pursued. But what none of these tools try to do, as far as I can tell, is incident management.

The work of incident response always involves a group of engineers: some of them are officially on-call, and others are just jumping in to help. Incident management is the coordination work that helps this ad-hoc team of responders work together effectively to get the diagnostic and remediation work done. Because of this, we often say that incident response is a team sport. Incidents involve some sort of problem with the system as a whole, and because everybody in the organization only has partial knowledge of the whole system, we typically need to pool that knowledge together to make sense of what’s actually happening right now in the system. For example, if a database is currently being overloaded, the folks who own the database could tell you that there’s been a change in query pattern, but they wouldn’t be able to tell you why that change happened. For that, you’d need to talk to the team that owns the system that makes those queries.

Fixation: the single-agent problem

Down the rabbit hole. Source: Sincerely Media

Another reason why we need multiple people responding to incidents is that humans are prone to a problem known as fixation. You might know it by the more colloquial term tunnel vision. A person will look at a problem from a particular perspective, and that can be problematic if the person addressing the problem has a perspective that is not well-matched to solving that problem. You can even see fixation behavior in the current crop of LLM coding tools: they will sometimes keep going down an unproductive path in order to implement a feature or try to resolve an error. While I expect that future coding agents will suffer less from fixation, given that genuinely intelligent humans frequently suffer from this problem, I don’t think that we’ll ever see an individual coding agent get to the point where it completely avoids fixation traps.

One solution to the problem of fixation is to intentionally inject a diversity of perspectives by having multiple individuals attack the problem. In the case of AI coding tools, we deal with the problem of fixation by having a human supervise the work of the coding agent. The human spots when the agent falls down a fixation rabbit hole, and prompts the agent to pursue a different strategy in order to get it back on track. Another way to leverage multiple individuals to is to strategically have them pursue different strategies. For example, in the early oughts, there was a lot of empirical software engineering research into an approach called perspective-based reading for reviewing software artifacts like requirements or design documents. The idea is that you would have multiple reviewers, and you would explicitly assign a reviewer a particular perspective. For example, let’s say you wanted to get a requirements document reviewed. You could have one reviewer read it from the perspective of a user, another from the perspective of a designer, and a third from the perspective of a tester. The idea here is that reading from a different perspective would help identify different kinds of defects in the artifact.

Getting back to incidents, the problem of fixation arises when a responder latches on to one particular hypothesis about what’s wrong with the system, and continues following on that particular line of investigation, even though it doesn’t bear fruit. As discussed above, having responders with a diverse set of perspectives provides a defense against fixation. This may take the form of multiple lines of doing multiple lines of investigation, or even just somebody in the response asking a question like, “How do we know the problem isn’t Y rather than X?”

I’m convinced that an individual AI SRE agent will never be able to escape the problem of fixation, and so that incident response will necessarily involve multiple agents. Yes, there will be some incidents where a single AI agent is sufficient. But incident response is a 100% game: you need to recover from all of them. That means that eventually you’ll need to deploy a team of agents, whether they’re humans, AI, or a mix. And that means incident response will require coordination: in particular, maintaining common ground.

Maintaining common ground is active work

During an incident, many different things are happening at once. There are multiple signals that you need to keep track of, like “what’s the current customer impact?”, “is the problem getting better, worse, or staying the same?”, “what are the current hypotheses?”, “which graphs support or contradict those hypotheses?” The responders will be doing diagnostic work, and they’ll be performing interventions to the system, sometimes to try to mitigate (e.g., “roll back that feature flag that aligns in time”), and other times to support the diagnostic work (e.g., “we need to make a change to figure out if hypothesis X is actually correct.”)

The incident manager helps to maintain common ground: they make sure that everybody is on the same page, by doing things like helping bring people up to speed on what’s currently going on, and ensuring people know which lines of investigation are currently being pursued and who (if anyone) is currently pursuing them.

If a responder is just joining an incident, an AI SRE agent is extremely useful as a summary machine. You can ask it the question, “what’s going on?”, and it can give you a concise summary of the state of play. But this is a passive use case: you prompt it, and it gives a response. But because the state of the world is changing rapidly during the incident, the accuracy of that answer will decay rapidly with time. Keeping the current state of things up to date in the minds of the responders is an active struggle against entropy.

An effective AI incident manager would have to be able to identify what type of coordination help people need, and then provide that assistance. For example, the agent would have to be able to identify when the responders (be they human or agent) were struggling and then proactively take action to assist. It would need a model of the mental models of the responders to know when to act and what to action to take in order to re-establish common ground.

Perhaps there is work in the AI SRE space to automate this sort of coordination work. But if there is, I haven’t heard of it yet. The focus today is on creating individual responder agents. I think these agents will be an effective addition to an incident response team. I’d love it if somebody built an effective incident management AI bot. But it’s a big leap from AI SRE agent to AI incident management agent. And it’s not clear to me how well the coordination problem is understood by vendors today.

Telling the wrong story

In last Sunday’s New York Times Book Review, there was an essay by Jennifer Szalai titled Hannah Arendt Is Not Your Icon. I was vaguely aware of Arendt as a public intellectual of the mid twentieth century, someone who was both philosopher and journalist. The only thing I really knew about her was that she had witnessed the trial of the Nazi official Adolph Eichmann and written a book on it, Eichmann in Jerusalem, subtitled a report on the banality of evil. Eichmann, it turned out, was not a fire-breathing monster, but a bloodless bureaucrat. He was dispassionately doing logistics work; it just so happened that his work was orchestrating the extermination of millions.

Until now, when I’d heard any reference to Arendt’s banality of evil, it had been as a notable discovery that Arendt had made as witness to the trial. And so I was surprised to read in Szalai’s essay how controversial Arendt’s ideas were when she originally published them. As Szala noted:

The Anti-Defamation League urged rabbis to denounce her from the pulpit. “Self-Hating Jewess Writes Pro-Eichmann Book” read a headline in the Intermountain Jewish News. In France, Le Nouvel Observateur published excerpts from the book and subsequently printed letters from outraged readers in a column asking, “Hannah Arendt: Est-elle nazie?”

Hannah Arendt, in turns out, had told the wrong story.

We all carry in our minds models of how the world works. We use these mental models to make sense of events that happen in the world. One of the tools we have for making sense of the world is storytelling; it’s through stories that we put events into a context that we can understand.

When we hear an effective story, we will make updates to our mental models based on its contents. But something different happens when we hear a story that is too much at odds with our worldview: we reject the story, declaring it to be obviously false. In Arendt’s case, her portrayal of Eichmann was too much of a contradiction against prevailing beliefs about the type of people who could carry out something like the Holocaust.

You can see a similar phenomenon playing out with Michael Lewis’s book Going Infinite, about the convicted crypto fraudster Sam Bankman-Fried. The reception to Lewis’s book has generally been negative, and he has been criticized for being too close to Bankman-Fried to write a clear-eyed book about him. But I think something else is at play here. I think Lewis told the wrong story.

It’s useful to compare Lewis’s book with two other recent ones about Silicon Valley executives: John Carreyrou’s Bad Blood and Sarah Wynn-Williams Careless People. Both books focus on the immorality of Silicon Valley executives (Elizabeth Holmes of Theranos in the first book, Mark Zuckerberg, Sheryl Sandberg, and Joel Kaplan of Facebook in the second). These are tales of ambition, hubris, and utter indifference to the human suffering left in their wake. Now, you could tell a similar story about Bankman-Fried. In fact, this is what Zeke Faux did in his book Number Go Up. but that’s not the story that Lewis told. Instead, Lewis told a very different kind of story. His book is more of a character study of a person with an extremely idiosyncratic view of risk. The story Lewis told about Bankman-Fried wasn’t the story that people wanted to hear. They wanted another Bad Blood, and that’s not the book he ended up writing. As a consequencee, he told the wrong story.

Telling the wrong story is a particular risk when it comes to explaining a public large-scale incidents. We’re inclined to believe that a big incident can only happen because of a big screw-up: that somebody must have done something wrong for that incident to happen. If, on the other hand, you tell a story about how the incident happened despite nobody doing anything wrong, then you are in essence telling an unbelievable story. And, by definition, people don’t believe unbelievable stories.

One example of such an incident story is the book Friendly Fire: The Accidental Shootdown of U.S. Black Hawks over Northern Iraq by Scott Snook. Here are some quotes from the Princeton University Press site for that book (emphasis mine).

On April 14, 1994, two U.S. Air Force F-15 fighters accidentally shot down two U.S. Army Black Hawk Helicopters over Northern Iraq, killing all twenty-six peacekeepers onboard. In response to this disaster the complete array of military and civilian investigative and judicial procedures ran their course. After almost two years of investigation with virtually unlimited resources, no culprit emerged, no bad guy showed himself, no smoking gun was found. This book attempts to make sense of this tragedy—a tragedy that on its surface makes no sense at all.

His conclusion is disturbing. This accident happened because, or perhaps in spite of everyone behaving just the way we would expect them to behave, just the way theory would predict. The shootdown was a normal accident in a highly reliable organization.

Snook also told the wrong story, one that subverts our usual sensemaking processes rather than supporting it: the accident makes no sense at all.

This is why I think it’s almost impossible to do an effective incident investigation for a public large-scale incident. The risk of telling the wrong story is simply too high.

Verizon outage report predictions

Yesterday, Verizon experienced a major outage. The company hasn’t released any details about how the outage happened yet, so there’s no quick takes to be had. And I have no personal experience in the telecom industry, and I’m not a network engineer, so I can’t even make any as-an-expert commentary, because I’m not nan expert. But I still thought it would be fun to make predictions about what the public write-up will reveal. I can promise that all of these predictions are free of hindsight bias!

Maintenance

My prediction: post-incident investigation of today’s Verizon outage will reveal planned maintenance as one of the contributing factors.Note: I have no actual knowledge of what happened today. This prediction is just to keep me intellectually honest.

Lorin Hochstein (@norootcause.surfingcomplexity.com) 2026-01-14T21:26:53.396Z

On Bluesky, I predicted this incident involved planned maintenance, because the last four major telecom outages I read about all involved planned maintenance. The one foremost on my mind was the Optus emergency services outage that happened back in September in Australia, where the engineers were doing software upgrades on firewalls

Work was being done to install a firewall upgrade at the Regency Park exchange in SA.
There is nothing unusual about such upgrades in a network and this was part of a planned
program, spread over six months, to upgrade eighteen firewalls. At the time this specific
project started, fifteen of the eighteen upgrades had been successfully completed. – Independent Report – The Triple Zero Outage at Optus: 18 September 2025.

The one before that was the Rogers internet outage that happened in my Canada back in July 2022.

 In the weeks leading to the day of the outage on 8 July 2022, Rogers was executing on a seven-phase process to upgrade its IP core network. The outage occurred during  the sixth phase of this upgrade process. – Assessment of Rogers Networks for Resiliency and Reliability Following the 8 July 2022 Outage – Executive Summary

There was also a major AT&T outage in 2024. From the FCC report:

On Thursday, February 22, 2024, at 2:42 AM, an AT&T Mobility employee placed a new network element into its production network during a routine night maintenance window in order to expand network functionality and capacity. The network element was misconfigured. – February 22, 2024 AT&T Mobility Network Outage REPORT AND FINDINGS

Verizon also suffered from a network outage back in September 30, 2024. Although the FCC acknowledged the outage, I couldn’t find any information from either Verizon or the FCC about the incident. The only information I was able to find about that outage comes from, of all places, a Reddit post. And it also mentions… planned maintenance!

So, we’re four for four on planned maintenance being in the mix.

I’m very happy that I did not pursue a career in network engineering: just given that the blast radius of networking changes can be very large, by the very nature of networks. It’s the ultimate example of “nobody notices your work when you do it well, they only become aware of your existence when something goes wrong. And, boy, can stuff go wrong!”

To me, networks is one of those “I can’t believe we don’t have even more outages” domains. Because, while I don’t work in this domain, I’m pretty confident that planned maintenance happens all of the time.

Saturation

The Rogers and AT&T outages involved saturation. From the Rogers executive summary (emphasis added), which I quoted in my original blog post

Rogers staff removed the Access Control List policy filter from the configuration of the distribution routers. This consequently resulted in a flood of IP routing information into the core network routers, which triggered the outage. The flood of IP routing data from the distribution routers into the core routers exceeded their capacity to process the information. The core routers crashed within minutes from the time the policy filter was removed from the distribution routers configuration. When the core network routers crashed, user traffic could no longer be routed to the appropriate destination. Consequently, services such as mobile, home phone, Internet, business wireline connectivity, and 9-1-1 calling ceased functioning.

From the FCC report on the AT&T outage:

Restoring service to commercial and residential users took several more hours as AT&T Mobility continued to observe congestion as high volumes of AT&T Mobility user devices attempted to register on the AT&T Mobility network. This forced some devices to revert back to SOS mode. For the next several hours, AT&T Mobility engineers engaged in additional actions, such as turning off access to congested systems and performing reboots to mitigate registration delays.

Saturation is such a common failure pattern in large-scale complex systems failures. We see it again and again, so often that I’m more surprised when it doesn’t show up. It might be that saturation contributed to a failure cascade, or that saturation made it more difficult to recover, but I’m predicting it’s in there somewhere.

“Somebody screwed up”

Here’s my pinned Bluesky post:

I have no information about how this incident came to be but I can confidently predict that people will blame it on greedy execs and sloppy devs, regardless of what the actual details are. And they will therefore learn nothing from the details.

Lorin Hochstein (@norootcause.surfingcomplexity.com) 2024-07-19T19:17:47.843Z

I’m going to predict that this incident will be attributed to engineers that didn’t comply with documented procedure for making the change, the kind of classic “root cause: human error” kind of stuff.

I was very critical of the Optus outage independent report for language like this:

These mistakes can only be explained by a lack of care about a critical service and a lack of disciplined adherence to procedure. Processes and controls were in place, but the correct process was not followed and actions to implement the controls were not done or not done properly.

The FCC report on the AT&T outage also makes reference to not following procedure (emphasis mine)

The Bureau finds that the extensive scope and duration of this outage was the result of
several factors, all attributable to AT&T Mobility, including a configuration error, a lack of adherence to AT&T Mobility’s internal procedures, a lack of peer review, a failure to adequately test after installation, inadequate laboratory testing, insufficient safeguards and controls to ensure approval of changes affecting the core network, a lack of controls to mitigate the effects of the outage once it began, and a variety of system issues that prolonged the outage once the configuration error had been remedied.

The Rogers independent report, to its credit, does not blame the operators for the outage. So I’m generalizing from only two data points for this prediction. I will be very happy if I’m wrong.

“Poor risk management”

This one isn’t a prediction, just an observation of a common element of two of the reports: criticizing the risk assessment of the change that triggered the incident. Here’s Optus report (emphasis in the original):

Risk was classified as ‘no impact’, meaning that there was to be no impact on network traffic, and the firewall upgrade was classified as urgent. This was the fifth mistake.

Similarly, the Rogers outage independent report blames the engineers for misclassifying the risk of the change:

Rogers classified the overall process – of which the policy filter configuration is only one of many parts – as “high risk”. However, as some earlier parts of the process were completed successfully, the risk level was reduced to “low”. This is an oversight in risk management as it took no consideration of the high-risk associated with BGP policy changes that had been implemented at the edge and affected the core.

Rogers had assessed the risk for the initial change of this seven-phased process as “High”. Subsequent changes in the series were listed as “Medium.” [redacted] was “Low” risk based on the Rogers algorithm that weighs prior success into the risk assessment value. Thus, the risk value for [redacted] was reduced to “Low” based on successful completion of prior changes.

The risk assessment rated as “Low” is not aligned with industry best practices for routing protocol configuration changes, especially when it is related to BGP routes distribution into the OSPF protocol in the IP core network. Such a configuration change should be considered as high risk and tested in the laboratory before deployment in the production network.

Unfortunately, it’s a lot easier to state”you clearly misjudged the risk!” then to ask “how did it make sense in the moment to assess the risk as low?”, and, hence, we learn nothing about how those judgments came to be.


I’m anxiously waiting to hear any more details about what happened. However, given that neither Verizon nor the FCC released any public information from the last outage, I’m not getting my hopes up.

The dangers of SSL certificates

Yesterday, the Bazel team at Google did not have a very Merry Boxing Day. An SSL certificate expired for https://bcr.bazel.build and https://releases.bazel.build, as shown in this screenshot from the github issue.

This expired certificate apparently broke the build workflow of users who use Bazel, who were faced with the following error message:

ERROR: Error computing the main repository mapping: Error accessing registry https://bcr.bazel.build/: Failed to fetch registry file https://bcr.bazel.build/modules/platforms/0.0.7/MODULE.bazel: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

After mitigation, Xùdōng Yáng provided a brief summary of the incident on the Github ticket:

Say the words “expired SSL certificate” to any senior software engineer and watch the expression on their face. Everybody in this industry has been bitten by expired certs, including people who work at orgs that use automated certificate renewal. In fact, this very case is an example of an automated certificate renewal system that failed! From the screenshot above:

it was an auto-renewal being bricked due to some new subdomain additions, and the renewal failures didn’t send notifications for whatever reason.

The reality is that SSL certificates are a fundamentally dangerous technology, and the Bazel case is a great example of why. With SSL certificates, you usually don’t have the opportunity to build up operational experience working with them, unless something goes wrong. And things don’t go wrong that often with certificates, especially if you’re using automated cert renewal! That means when something does go wrong, you’re effectively starting from scratch to figure out how to fix it, which is not a good place to be. Once again, from that summary:

And then it took some Bazel team members who were very unfamiliar with this whole area to scramble to read documentation and secure permissions…

Now, I don’t know the specifics of the Bazel team composition: it may very well be that they have local SSL certificate expertise on the team, but those members were out-of-office because of the holiday. But even if that’s the case, with an automated set-it-and-forget-it solution, the knowledge isn’t going to spread across the team, because why would it? It just works on its own.

That is, until it stops working. And that’s the other dangerous thing about SSL certificates: the failure mode is the opposite of graceful degradation. It’s not like there’s an increasing percentage of requests that fail as you get closer to the deadline. Instead, in one minute, everything’s working just fine, and in the next minute, every http request fails. There’s no natural signal back to the operators that the SSL certificate is getting close to expiry. To make things worse, there’s no staging of the change that triggers the expiration, because the change is time, and time marches on for everyone. You can’t set the SSL certificate expiration so it kicks in at different times for different cohorts of users.

In other words, SSL certs are a technology with an expected failure mode (expiration) that absolutely maximizes blast radius (a hard failure for 100% of users), without any natural feedback to operators that the system is at imminent risk of critical failure. And with automated cert renewal, you are increasing the likelihood that the responders will not have experience with renewing certificates.

Is it any wonder that these keep biting us?

Saturation: Waymo edition

If you’ve been to San Francisco recently, you will almost certainly have noticed the Waymo robotaxis: these are driverless cars that you can hail with an app the way that you can with Uber. This past Sunday, San Francisco experienced a pretty significant power outage. One unexpected consequence of this power outage was that the Waymo robotaxis got stuck.

Today, Waymo put up a blog post about what happened, called Autonomously navigating the real world: lessons from the PG&E outage. Waymos are supposed to treat intersections with traffic lights out as four-way stops, the same way that humans do. So, what happened here? From the post (emphasis added):

While the Waymo Driver is designed to handle dark traffic signals as four-way stops, it may occasionally request a confirmation check to ensure it makes the safest choice. While we successfully traversed more than 7,000 dark signals on Saturday, the outage created a concentrated spike in these requests. This created a backlog that, in some cases, led to response delays contributing to congestion on already-overwhelmed streets.

The post doesn’t go into detail about what a confirmation check is. My interpretation based on the context is that it’s a put-a-human-in-the-loop thing, where a remote human teleoperator checks to see if it’s safe to proceed. It sounds like the workload on the human operators was just too high to process all of these confirmation checks in a timely matter. You can’t just ask your cloud provider for more human operators the way you can request more compute resources.

The failure mode that Waymo encountered is a classic example of saturation, which is a topic I’ve written about multiple times in this blog. Saturation happens when the system is not able to keep up with the load that is placed upon it. Because all systems have finite resources, saturation is an ever-present risk. And because saturation only happens under elevated load, it’s easy to miss this risk. There are many different things in your system that can run out of resources, and it can be hard to imagine the scenarios that can lead to exhaustion for each of them.

Here’s another quote from the post. Once again, emphasis mine.

We established these confirmation protocols out of an abundance of caution during our early deployment, and we are now refining them to match our current scale. While this strategy was effective during smaller outages, we are now implementing fleet-wide updates that provide the Driver with specific power outage context, allowing it to navigate more decisively.

This confirmation-check behavior was explicitly implemented in order to increase safety! It’s yet another reminder how work to increase safety can lead to novel, unanticipated failure modes. Strange things are going to happen, especially at scale.

Another way to rate incidents

Every organization I’m aware of that does incident management has some sort of severity rating system. The highest severity is often referred to as either a SEV1 or a SEV0 depending on the organization. (As is our wont, we software types love arguing about whether indexing should begin at zero or at one).

Severity can be a useful shorthand for communicating to an organization during incident response, although it’s a stickier concept than most people realize (for example, see Em Ruppe’s SRECon ’24 talk What Is Incident Severity, but a Lie Agreed Upon? and Dan Slimmon’s blog post Incident SEV scales are a waste of time). However, after the incident has been resolved, severity serves a different purpose: the higher the severity, the more attention the incident will get in the post-incident activities. I was reminded of this by John Allspaw’s Fix-mas Pep Talk, which is part of Uptime Labs’s Fix-mas Countdown, a sort of Advent of Incidents. In the short video, John argues for the value in spending time analyzing lower severity incidents, instead of only analyzing the higher severity ones.

Even if you think John’s idea a good one (and I do!), lower severity incidents happen more often than higher severity ones, and you probably don’t have the resources to analyze every single lower severity incident that comes along. And that got me thinking: what if, in addition to rating an incident by severity, we also gave each incident a separate rating on its learning potential? This would be a judgment on how much insight we think we would get if we did a post-incident analysis, which will help us decide whether we should spend the time actually doing that investigation.

Now, there’s a paradox here, because we have to make this call before we’ve done an actual post-incident investigation, which means we don’t yet know what we’ll learn! And, so often, what appears on the surface to be an uninteresting incident is actually much more complex once we start delving into the details.

However, we all have a finite number of cycles. And so, like it or not, we always have to make a judgment about which incidents we’re going to spend our engineering resources on in doing an analysis. The reason I like the idea of having a learning potential assessment is that it forces us to put initial investment into looking for those interesting threads that we could pull on. And it also makes explicit that severity and learning potential or two different concerns. And, as software engineers, we know that separation of concerns is a good idea!