Cloudflare and the infinite sadness of migrations

(With apologies to The Smashing Pumpkins)

A few weeks ago, Cloudflare experienced a major outage of their popular 1.1.1.1 public DNS resolver.

Technically, the DNS resolver itself was working just fine: it was (as far as I’m aware) up and running the whole time. The problem was that nobody on the Internet could actually reach it. The Cloudflare public write-up is quite detailed, and I’m not going to summarize it here. I do want to bring up one aspect of their incident, because it’s something I worry about a lot from a reliability perspective: migrations.

Cloudflare’s migration

When this incident struck, Cloudflare supported two different ways of managing what they call service topologies. There was a newer system that supported progressive rollout, and an older system where the changes occurred globally. The Cloudflare incident involved the legacy system, which makes global changes, which is why the blast radius of this incident was so large.

Source: https://blog.cloudflare.com/cloudflare-1-1-1-1-incident-on-july-14-2025/

Cloudflare engineers were clearly aware that these sorts of global changes are dangerous. After all, I’m sure that’s one of the reasons why they built their new system in the first place. But migrating all of the way to the new thing takes time.

Migrations and why I worry about them

If you’ve ever worked at any sort of company that isn’t a startup, you’ve had to deal with a migration. Sometimes a migration impacts only a single team that owns the system in question, but often migrations are changes that are large in scope (typically touching many teams) which, while providing new capabilities to the organization as a whole, don’t provide much short-term benefit to the teams who have to make a change to accommodate the migration.

A migration is a kind of change that, almost by definition, the system wasn’t originally designed to accommodate. We build our systems to support making certain types of future changes, and migrations are exactly not these kinds of changes. Each migration is typically a one-off type of change. While you’ll see many migrations if you work at a more mature tech company, each one will be different enough that you won’t be able to leverage common tooling from one migration to help make the next one easier.

All of this adds up to reliability risk. While a migration-related change wasn’t a factor in the Cloudflare incident, I believe that such changes are inherently risky, because you’re making a one-off change to the way that your system works. Developers generally have a sense that these sorts of changes are risky. As a consequence, for an individual on a team who has to do work to support somebody else’s migration, all of the incentives push them towards dragging their feet: making the migration-related change takes time away from their normal work, and increases the risk they break something. On the other hand, completing the migration generally doesn’t provide them short-term benefit. The costs typically outweigh the benefits. And so all of the forces push towards migrations taking a long time.

But a delay in implementing a migration is also a reliability risk, since migrations are often used to improve the reliability of the system. The Cloudflare incident is a perfect example of this: the newer system was safer than the old one, because it supported staged rollout. And while they ran the new system, they had to run the old one as well.

Why run one system when you can run two?

The scariest type of migration to me is the big bang migration, where you cut over all at once from the old system to the new one. Sometimes you have no choice, but it’s an approach that I personally would avoid whenever possible. The alternative is to do incremental migration, migrating parts of the system over time. To do incremental migration, you need to run the old system and the new system concurrently, until you’ve completely finished the migration and can shut the old system down. When I worked at Netflix, people used the term Roman riding to refer to running the old and new system in parallel, in reference to a style of horseback riding.

The problem with Roman riding is that it’s risky as well. While incremental is safer than big bang, running two systems concurrently increases the complexity of the system. There are many, many opportunities for incidents while you’re in the midst of a migration running the two systems in parallel.

What is to be done?

I wish I had a simple answer here. But my unsatisfying one is that engineering organizations at tech companies need to make migrations a part of their core competency, rather than seeing them as one-off chores. I frequently joke that platform engineering should really be called migration engineering, because any org large enough to do platform engineering is going to be spending a lot of its cycles doing migrations.

Migrations are also unglamorous work: nobody’s clamoring for the title of migration engineer. People want to work on greenfield projects, not deal with the toil of a one-off effort to move the legacy thing onto the new thing. There’s also not a ton written on doing migrations. A notable exception is (fellow TLA+ enthusiast) Marianne Bellotti’s book Kill It With Fire, which sits on my bookshelf, and which I really should re-read.

I’ll end this post with some text from the “Remediation and follow-up steps” of the Cloudflare writeup:

We are implementing the following plan as a result of this incident:

Staging Addressing Deployments: Legacy components do not leverage a gradual, staged deployment methodology. Cloudflare will deprecate these systems which enables modern progressive and health mediated deployment processes to provide earlier indication in a staged manner and rollback accordingly.

Deprecating Legacy Systems: We are currently in an intermediate state in which current and legacy components need to be updated concurrently, so we will be migrating addressing systems away from risky deployment methodologies like this one. We will accelerate our deprecation of the legacy systems in order to provide higher standards for documentation and test coverage.

I’m sure they’ll prioritize this particular migration because of the attention garnered on it from this incident. But I also bet there are a whole lot more in-flight migrations at Cloudflare, as well as at other companies, that increase complexity through maintaining two systems and delaying moving to the safer thing. What are they actually going to do in order to complete those other migrations more quickly? If it was easy, it would already be done.

Re-reading Technopoly

Technopoly by Neil Postman, published in 1993

“Can language models be too big? asked the researchers Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell in their famous Stochastic Parrots paper about LLMs, back in 2021. Technopoly is Neil Postman’s answer to that question, despite it beoing written back in the mid-nineties.

Postman is best known for his 1985 book Amusing Ourselves to Death, about the impact of television on society. Postman passed away in 2003, one year before Facebook was released, and two years before YouTube. This was probably for the best, as social media and video sharing services like Instagram and TikTok would have horrified him, being the natural evolution of the trends he was writing about in the 1980s.

The rise of LLMs inspired me to recently re-read Technopoly. In what Postman calls technopoly, technological progress becomes the singular value that society pursues. A technopoly treats access to information as an intrinsic good: more is always better. As a consequence, it values removing barriers to the collection and transmission of information; Postman uses the example of the development of the telegraph as a technology that eliminated distance as an information constraint.

The collection and transmission of information is central to Postman’s view of technology, the book focuses entirely on such technologies, such as writing, the stethoscope, the telescope, and the computer; he would have been very comfortable with our convention of referring to software-based companies as tech companies. Consider Google’s stated mission: to organize the world’s information and make it universally accessible and useful. This statement makes for an excellent summary of the value system of technopoly. In a technopoly, the solutions to our problems can always be found by collecting and distributing more information.

More broadly, Postman notes that the worldview of technopoly is captured in Frederick Taylor’s principles of scientific management:

the primary goal of work is efficiency
technical calculation is always to be preferred over human judgment, which is not trustworthy
subjectivity is the enemy of clarity of thought
what cannot be measured can be safely ignored, because it either does not exist, or it has no value

I was familiar with Taylor’s notion of scientific management before, but it was almost physically painful for me to see its values laid out explicitly like this, because it describes the wall that I so frequently crash into when I try to advocate for a resilience engineering perspective on how to think about incidents and reliability. Apparently, I am an apostate in the Church of Technopoly.

Postman was concerned about the harms that can result from treating more information as an unconditional good. He worried about information for its own sake, divorced of human purpose and stripped of its constraints, context, and history. Facebook ran headlong into the dangers of unconstrained information transmission when its platform was leveraged in Myanmar to promote violence. In her memoir Careless People, former Facebook executive Sarah Wynn-Williams documents how Facebook as an organization was fundamentally unable to deal with the negative consequences of the platform that they had constructed. Wynn-Williams focuses on the moral failures of the executive leadership of Facebook, hence the name of the book. But Postman would also indict technopoly itself, the value system that Facebook was built on, with its claims that disseminating information is always good. In a technopoly, reducing obstacles to information access is always a good thing.

Technopoly as a book is weakest in its critique of social science. Postman identifies social scientists as a class of priests in a technopoly, the experts who worship technopoly’s gods of efficiency, precision, and objectivity. His general view of social science research results are that they are all either obviously true or absurdly false, where the false ones are believed because they come from science. I think Postman falls into the same trap as the late computer scientist Edsger Dijkstra in discounting the value of social science, both in Duncan Watts’s sense of Everything is Obvious: Once you Know the Answer and in the value of good social science protecting us from bad social science. I say this as someone who draws from social science research every day when I examine an incident. Given Postman’s role as a cultural critic, I suspect that there’s some “hey, you’re on my turf!” going on here.

Postman was concerned that technopoly is utterly uninterested in human purpose or a coherent worldview. And he’s right that social science is silent on both matters. But his identification of social scientists as technopoly’s priests hasn’t borne out. Social science certainly has its problems, with the replication crisis in psychology being a glaring example. But that’s a crisis that undermines faith in psychology research, whereas Postman was worried about people putting too much trust in the outcomes of psychology research. I’ll note that the first author of the Stochastic Parrots paper, Emily Bender, is a linguistics professor. In today’s technopoloy, there are social scientists that are pushing back on the idea that more information is always better.

Overall, the book stands up well, and is even more relevant today than when it was originally published, thirty-odd years ago. While Postman did not foresee the development of LLMs, he recognized that maximizing the amount of accessible information will not be the benefit to mankind that that its proponents claim. That we so rarely hear this position advocated is a testament to his claim that we are, indeed, living in a technopoly.

Component defects: RCA vs RE

Let’s play another round where contrast the root-cause-analysis (RCA) perspective to the resilience engineering (RE) perspective. Today’s edition is about the distribution of potentially incident-causing defects across the different components in the system. Here, I’m using RCA nomenclature, since the kinds of defects that an RCA advocate would refer to as a “cause” in the wake of an incident would be called a “contributor” by the RE folks.

Here’s a stylized view of the world from the RCA perspective:

RCA view of distribution of potential incident-causing defects in the system

Note that there are a few particularly problematic components: we should certainly focus our reliability efforts on figuring out which of the components we should be spending our reliability efforts on improving!

Now let’s look at the RE perspective:

RE view of distribution of potential incident-contributing defects in the system

It’s a sea of red! The whole system is absolutely shot through with defects that could contribute to an incident!

Under the RE view, the individual defects aren’t sufficient to cause an incident. Instead, it’s an interaction of these defects with other things, including other defects. Because incidents arise due to interactions, RE types will stress the importance of understanding interactions across components over the details of the specific component that happened to contain the defect that contributed to the outage. After all, according to RE folks, those defects are absolutely everywhere. Focusing on one particular component won’t yield significant improvements under this model.

If you want to appreciate the RE perspective, you need to develop an understanding how it can be that the system is up right now despite the fact that it is absolutely shot through with all of these potentially incident-causing defects, as the RCA folks would call them. As an RE type myself, I believe that your system is up right now, and that it already contains the defect that will be implicated in the next incident. After that incident happens, the tricky part isn’t identifying the defect, it’s appreciating how the defect alone wasn’t enough to bring it down.

“What went well” is more than just a pat on the back

When writing up my impressions of the GCP incident report, Cindy Sridharan’s tweet reminded me that I failed to comment on an important part of it, how the responders brought the overloaded system back to a healthy state.

Full report for yesterday’s Google outage

– not feature flagging a new code path
– null pointer crashing a binary when reading empty fields from Spanner
– remediation causing a thundering herd as there was no exponential backoff
– which required manual throttling to recover from pic.twitter.com/6AQ4BQk5ex
— Cindy Sridharan (@copyconstruct) June 14, 2025

Which brings me to the topic of this post: the “what went well” section of an incident write-up. Generally, public incident write-ups don’t have such sections. This is almost certainly for rational political reasons: it would be, well, gauche to recount to your angry customers about what a great job you did handling the incident. However, internal write-ups often have such sections, and that’s my focus here.

In my experience, “What went well” is typically the shortest section in the entire incident report, with a few brief bullet points that point out some positive aspects of the response (e.g., people responded quickly). It’s a sort of way-to-go!, a way to express some positive feedback to the responders on a job well done. This is understandable, as people believe that if we focus more on what went wrong than what went well, then we are more likely to improve the system, because we are focusing on repairing problems. This is why “what went wrong” and “what can we do to fix it” takes the lion’s share of the attention.

But the problem with this perspective is that it misunderstands the skills that are brought to bear during incident response, and how learning from a previously well-handled incident can actually help other responders do better in future incidents. Effective incident response happens because the responders are skilled. But every incident response team is an ad-hoc one, and just because you happened to have people with the right set of skills responding last time, doesn’t mean you’ll have the people with the right set the next time. This means that if you gloss over what went well, your next incident might be even worse than the last one, because you’ve described those future responders of the opportunity to learn from observing the skilled responders last time.

To make this more concrete, let’s look back at that the GCP incident report. In this scenario, the engineers had put in a red-button as a safety precaution and exercised it to remediate the audience.

As a safety precaution, this code change came with a red-button to turn off that particular policy serving path… Within 2 minutes, our Site Reliability Engineering team was triaging the incident. Within 10 minutes, the root cause was identified and the red-button (to disable the serving path) was being put in place.

However, that’s not the part that interests me so much. Instead, it’s the part about how the infrastructure became overloaded as a consequence of the remediation, and how the responders recovered from overload.

Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure…. It took up to ~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional databases to reduce the load.

This was not a failure scenario that they had explicitly designed for in advance of deploying the change: there was no red-button they could simply exercise to roll back the system to a non-overloaded state. Instead, they were forced to improvise a solution based on the controls that were available to them. In this case, they were able to reduce the load by turning down the rate of task creation, as well as by re-routing traffic away from the overloaded database.

And this sort of work is the really interesting bit an incident: how skilled responders are able to take advantage of generic functionality that is available in order to remediate an unexpected failure mode. This is one of the topics that the field of resilience engineering focuses on, how incident responders are able to leverage generic capabilities during a crunch. If I was an engineer at Google in this org, I would be very interested to learn what knobs are available and how to twist them. Describing this in detail in an incident write-up will increase my chances of being able to leverage this knowledge later. Heck, even just leaving bread crumbs in the doc will help, because I’ll remember the incident, look up the write-up, and follow the links.

Another enormously useful “what went well” aspect that often gets short shrift is a description of the diagnostic work: how the responders figured out what was going on. This never shows up in public incident write-ups, because the information is too proprietary, so I don’t blame Google for not writing about how the responders determined the source of the overload. But all too often these details are left out of the internal write-ups as well. This sort of diagnostic work is a crucial set of skills for incident response, and having the opportunity to read about how experts applied their skills to solve this problem help transfers these skills across the organization.

Here’s my claim: providing details on how things went well will reduce your future mitigation time even more than focusing on what went wrong. While every incident is different, the generic skills are common, and so getting better at response will get you more mileage than preventing repeats of previous incidents. You’re going to keep having incidents over and over. The best way to get better at incident handling is to handle more incidents yourself. The second best way is to watch experts handle incidents. The better you do at telling the stories of how your incidents were handled, the more people will learn about how to handle incidents.

Quick takes on the GCP public incident write-up

On Thursday (2025-06-12), Google Cloud Platform (GCP) had an incident that impacted dozens of their services, in all of their regions. They’ve already released an incident report (go read it!), and here are my thoughts and questions as I read it.

Note that the questions I have shouldn’t be explicitly seen as a critique as of the write-up, as the answers to the questions generally aren’t publicly shareable. They’re more in the “I wish I could be a fly on the wall inside of Google” questions.

Quick write-up

First, a meta-point: this is a very quick turnaround for a public incident write-up. As a consumer of these, I of course appreciate getting it faster, and I’m sure there was enormous pressure inside of the company to get a public write-up published as soon as possible. But I also think there are hard limits on how much you can actually learn about an incident when you’re on the clock like this. I assume that Google is continuing to investigate internally how the incident happened, and I hope that they publish another report several weeks from now with any additional details that they are able to share publicly.

Staging land mines across regions

Note that impact (June 12) happened two weeks after deployment (May 29).

This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code.

The system involved is called Service Control. Google stages their deploys of Service Control by region, which is a good thing: staging your changes is a way of reducing the blast radius if there’s a problem with the code. However, in this case, the problematic code path was not exercised during the regional rollout. Everything looked good in the first region, and so they deployed to the next region, and so on.

This the land mine risk: when the code you are rolling out contains a land mine which is not tripped during the rollout.

How did the decisions make sense at the time?

The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash.

This is the typical “we didn’t do X in this case and had we done X, this incident wouldn’t have happened, or wouldn’t have been as bad” sort of analysis that is very common in these write-ups. The problem with this is that it implies sloppiness on the part of the engineers, that important work was simply overlooked. We don’t have any sense on how the development decisions made sense at the time.

If this scenario was atypical (i.e., usually error handling and feature flags are added), what was different about this development case? We don’t have the context about what was going on during development, which means we (as external readers) can’t understand how this incident actually was enabled.

Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging.

How do they know it would have been caught in staging, if it didn’t manifest in production until two weeks after roll-out? Are they saying that adding a feature flag would have led to manual testing of the problematic code path in staging? Here I just don’t know enough about Google’s development processes to make sense of this observation.

Service Control did not have the appropriate randomized exponential backoff implemented to avoid [overloading the infrastructure].

As I discuss later, I’d wager it’s difficult to test for this in general, because the system generally doesn’t run in the mode that would exercise this. But I don’t have the context, so it’s just a guess. What’s the history behind Service Control’s backoff behavior? By definition, Without knowing its history, we can’t really understand how its backoff implementation came to be this way.

Red buttons and feature flags

As a safety precaution, this code change came with a red-button to turn off that particular policy serving path. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. (emphasis added)

Because I’m unfamiliar with Google’s internals, I don’t understand how their “red button” system works. In my experience, the “red button” type functionality is built on top of feature flag functionality, but that does not seem to be the case at Google, since here there was no feature flag, but there was a big red button.

It’s also interesting to me that, while this feature wasn’t feature-flagged it was big-red-buttoned. There’s a story here! But I don’t know what it is.

New feature: additional policy quota checks

On May 29, 2025, a new feature was added to Service Control for additional quota policy checks… On June 12, 2025 at ~10:45am PDT, a policy change was inserted into the regional Spanner tables that Service Control uses for policies.

I have so many questions.. What were these additional quota policy checks? What was the motivation for adding these checks (i.e., what problem are the new checks addressing)? Is this customer-facing functionality (e.g., GCP Cloud Quotas), or is this an internal-only? What was the purpose of the policy change that was inserted on June 12 (or was it submitted by a customer)? Did that policy change take advantage of the new Service Control features that were added on May 29? Was that the first policy change that happened since the new feature was deployed, or had there been others? How frequently do policy changes happen?

Global data changes

Given the global nature of quota management, this metadata was replicated globally within seconds.

While code and feature flag changes are staged across regions, apparently quota management metadata is designed to replicate globally.

Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues. (emphasis mine)

The implication I take from from the text was that there was a business requirement for quota management data changes to happen globally rather than staged, and that they are now going to push back on that.

What was the rationale for this business requirement? What are the tradeoffs involved in staging these changes versus having them happen globally? What new problems might arise when data changes are staged like this?

Are we going to be reading a GCP incident report in a few years that resulted from inconsistency of this data across regions due to this change?

Saturation!

Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure.

Here we have a classic example of saturation, where a database got overloaded. Note that saturation wasn’t the trigger here, but it made recovery more difficult. Our system is in a different mode during incident recovery than it is during normal mode, and it’s generally very difficult to test for how it will behave when it’s in recovery mode.

Does this incident match my conjecture?

I have a long-standing conjecture that once a system reaches a certain level of reliability, most major incidents will involve:

A manual intervention that was intended to mitigate a minor incident, or
Unexpected behavior of a subsystem whose primary purpose was to improve reliability

I don’t have enough information in this write-up to be able to make a judgment in this case: it depends on whether or not the quota management system’s purpose is to improve reliability. I can imagine it going either way. If it’s a public-facing system to help customers limit their costs, then that’s more of a traditional feature. On the other hand, if it’s to limit the blast radius of individual user activity, then that feels like a reliability improvement system.

What are the tradeoffs of the corrective actions?

The write-up lists seven bullets of corrective actions. The questions I always have of corrective actions are:

What are the tradeoffs involved in implementing these corrective actions?
How might they enable new failure modes or make future incidents more difficult to deal with?

AI at Amazon: a case study of brittleness

A year ago, Mihail Eric wrote a blog post detailing his experiences working on AI inside Amazon: How Alexa Dropped the Ball on Being the Top Conversational System on the Planet. It’s a great first-person account, with lots of detail of the issues that kept Amazon from keeping up with its peers in the LLM space. From my perspective, Eric’s post makes a great case study in what resilience engineering researchers refer to as brittleness, which is a term that the researchers use to refer to as a kind of opposite of resilience.

In the paper Basic Patterns in How Adaptive Systems Fail, the researchers David Woods and Matthieu Branlat note that brittle systems tend to suffer from the following three patterns:

Decompensation: exhausting capacity to adapt as challenges cascade
Working at cross-purposes: behavior that is locally adaptive but globally maladaptive
Getting stuck in outdated behaviors: the world changes but the system remains stuck in what were previously adaptive strategies (over-relying on past successes)

Eric’s post demonstrates how all three of these patterns were evident within Amazon.

Decompensation

It would take weeks to get access to any internal data for analysis or experiments…
Experiments had to be run in resource-limited compute environments. Imagine trying to train a transformer model when all you can get a hold of is CPUs. Unacceptable for a company sitting on one of the largest collections of accelerated hardware in the world.

If you’ve ever seen a service fall over after receiving a spike in external requests, you’ve seen a decompensation system failure. This happens when a system isn’t able to keep up with the demands that are placed upon on it.

In organizations, you can see the decompensation failure pattern emerge when decision-making is very hierarchical: you end up having to wait for the decision request to make its way up to someone who has the authority to make the decision, and then make its way down again. In the meantime, the world isn’t standing still waiting for that decision to be made.

As described in the Bad Technical Process section of Eric’s post, Amazon was not able to keep up with the rate at which its competitors were making progress on developing AI technology, even though Amazon had both the talent and the compute resources necessary in order to make progress. The people inside the organization who needed the resources weren’t able to get them in a timely fashion. That slowed down AI development and, consequently, they got lapped by their competitors.

Working at cross-purposes

Alexa’s org structure was decentralized by design meaning there were multiple small teams working on sometimes identical problems across geographic locales.

This introduced an almost Darwinian flavor to org dynamics where teams scrambled to get their work done to avoid getting reorged and subsumed into a competing team.

The consequence was an organization plagued by antagonistic mid-managers that had little interest in collaborating for the greater good of Alexa and only wanted to preserve their own fiefdoms.

My group by design was intended to span projects, whereby we found teams that aligned with our research/product interests and urged them to collaborate on ambitious efforts. The resistance and lack of action we encountered was soul-crushing.

Where decompensation is a consequence of poor centralization, working at cross-purposes is a consequence of poor decentralization. In a decentralized organization, the individual units are able to work more quickly, but there’s a risk of alignment: enabling everyone to row faster isn’t going to help if they’re rowing in different directions.

In the Fragmented Org Structures section of Eric’s writeup, he goes into vivid, almost painful detail about how Amazon’s decentralized org structure worked against them.

Getting stuck in outdated behaviors

Alexa was viciously customer-focused which I believe is admirable and a principle every company should practice. Within Alexa, this meant that every engineering and science effort had to be aligned to some downstream product.

That did introduce tension for our team because we were supposed to be taking experimental bets for the platform’s future. These bets couldn’t be baked into product without hacks or shortcuts in the typical quarter as was the expectation.

So we had to constantly justify our existence to senior leadership and massage our projects with metrics that could be seen as more customer-facing.

…

This introduced product/science conflict in every weekly meeting to track the project’s progress leading to manager churn every few months and an eventual sunsetting of the effort.

I’m generally not a fan of management books, but What got you here won’t get you there is a pretty good summary of the third failure pattern: when organizations continue to apply approaches that were well-suited to problems in the past but are ill-suited to problems in the present.

In the Product-Science Misalignment section of his post, Eric describes how Amazon’s traditional viciously customer-focused approach to development was a poor match for the research-style work that was required for developing AI. Rather than Amazon changing the way they worked in order to facilitate the activities of AI researchers, the researchers had to try to fit themselves into Amazon’s pre-existing product model. Ultimately, that effort failed.

I write mostly about software incidents on this blog, which are high-tempo affairs. But the failure of Amazon to compete effectively in the AI space, despite its head start with Alexa, its internal talent, and its massive set of compute resources, can also be viewed as a kind of incident. As demonstrated in this post, we can observe the same sorts of patterns in failures that occur in the span of months as we can in failures that occur in the span of minutes. How well Amazon is able to learn from this incident remains to be seen.

Pattern machines that we don’t understand

How do experts make decisions? One theory is that they generate a set of options, estimate the cost and benefits of each option, and then choose the optimal one. The psychology researcher Gary Klein developed a very different theory of expert decision-making, based on his studies of expert decision-making in domains such as firefighting, nuclear power plant operations, aviation, anesthesiology, nursing, and the military. Under Klein’s theory of naturalistic decision-making, experts use a pattern-matching approach to make decisions.

Even before Klein’s work, humans are already known to be quite good at pattern recognition. We’re so good at spotting faces that we have a tendency to see things as faces that aren’t actually faces, a phenomenon known as pareidolia.

As far as I’m aware, Klein used the humans-as-black-boxes research approach of observing and talking to the domain experts: while he was metaphorically trying to peer inside their heads, he wasn’t doing any direct measurement or modeling of their brains. But if you are inclined to take a neurophysiological view of human cognition, you can see how the architecture of the brain provides a mechanism for doing pattern recognition. We know that the brain is organized as an enormous network of neurons, which communicate with each other through electrical impulses.

The psychology researcher Frank Rosenblatt is generally credited with being the first researcher to do computer simulations of a model of neural networks, in order to study how the brain works. He called his model a perceptron. In his paper The Perceptron: a probabilistic model for information storage and organization in the brain, he noted pattern recognition as one of the capabilities of the perceptron.

While perceptrons may have started out as a model for psychology research, they became one of a competing set of strategies for building artificial intelligence systems. The perceptron approach to AI was dealt a significant blow by the AI researchers Marvin Minsky and Seymour Papert in 1969 with the publication of their book Perceptrons. Minsky and Papert demonstrated that there were certain cognitive tasks that perceptrons were not capable of performing.

However, Minsky and Papert’s critique applied to only single-layer perceptron networks. It turns out that if you create a network out of multiple layers, and you add non-linear processing elements to the layers, then these limits to the capabilities of a perceptron no longer apply. When I took a graduate-level artificial neural networks course back in the mid 2000s, the networks we worked with had on the order of three layers. Modern LLMs have a lot more layers than that: the deep in deep learning refers to the large number of layers. For example, the largest GPT-3 model (from OpenAI) has 96 layers, the larger DeepSeek-LLM model (from DeepSeek) has 95 layers, and the largest Llama 3.1 model (from Meta) has 126 layers.

Here’s a ridiculously oversimplified conceptual block diagram of a modern LLM.

There’s an initial stage which takes text and turns it into a sequence of vectors. Then, those sequence of vectors get passed through the layers in the middle. Finally, you get your answer out at the end. (Note: I’m deliberately omitting discussion about what actually happens in the stages depicted by the oval and the diamond above, because I want to focus here on the layers in the middle for this post. I’m not going to talk at all about concepts like tokens, embedding, attention blocks, and so on. If you’re interested in these sorts of details, I highly recommend the video But what is a GPT? Visual intro to Transformers by Grant Sanderson).

We can imagine the LLM as a system that recognizes patterns at different levels of abstraction. The first and last layers deal directly with representations of words, so they have to operate at the word level of abstraction, let’s think of that as the lowest layer. As we go deeper into the network initially, we can imagine each layer as dealing with patterns at a higher level of abstraction, we could call them concepts. Since the last layer deals with words again, layers towards the end would be at a lower layer of abstraction.

But, really, this talk of encoding patterns at increasing and decreasing levels of abstraction is all pure speculation on my part, there’s no empirical basis to this. In reality, we have no idea what sorts of patterns are encoded in the middle layers. Do they correspond to what we humans think of as concepts? We simply have no idea how to interpret the meaning of the vectors that are generated by the intermediate layers. Are the middle layers “higher level” than the outer layers in the sense that we understand that term? Who knows? We just know that we get good results.

The things we call models have different kinds of applications. We tend to think first of scientific models, which are models that give scientists insight into how the world works. Scientific models are a type of model, but not the only one. There are also engineering models, whose purpose is to accomplish some sort of task. A good example of an engineering model is a weather prediction model that tells us what the weather will be like this week. Another good example of an engineering model is SPICE, which electrical engineers use to simulate electronic circuits.

Perceptrons started out as a scientific model of the brain, but their real success has been as an engineering model. Modern LLMs contain within them feedforward neural networks, which are the intellectual descendants of Rosenblatt’s perceptrons. Some people even refer to these as multilayer perceptrons. But LLMs are not an engineering model that was designed to achieve a specific task, the way that weather models or circuits models do. Instead, these are models that were designed to predict the next word in a sentence, and it just so happens that if you build and train your model the right way, you can use it to perform cognitive tasks that it was not explicitly designed to do! Or, as Sean Goedecke put it in a recent blog post (emphasis mine)

Transformers work because (as it turns out) the structure of human language contains a functional model of the world. If you train a system to predict the next word in a sentence, you therefore get a system that “understands” how the world works at a surprisingly high level. All kinds of exciting capabilities fall out of that – long-term planning, human-like conversation, tool use, programming, and so on.

This is a deeply weird and surprising outcome about building a text prediction system. We’ve built text prediction systems before. Claude Shannon was writing about probability-based models of natural language back in the 1940s in his famous paper that gave birth to the field of information theory. But it’s not obvious that once these models got big enough, we’d get results like we’re getting today, where you could ask the model questions and get answers. At least, it’s not obvious to me.

In 2020, the linguistics researchers Emily Bender and Alexander Koller published a paper titled Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. This is sometimes known as the octopus paper, because it contains a thought experiment about a hyper-intelligent octopus eavesdropping on a conversation between two English speakers by tapping into an undersea telecommunications cable, and how the octopus could never learn the meaning of English phrases through mere exposure. This seems to contradict Goedecke’s observation. They also note how research has demonstrated that humans are not capable of learning a new language through mere exposure to it (e.g., through TV or radio). But I think the primary thing this illustrates is how fundamentally different LLMs are from human brains, and how little we can learn about LLMs by making comparisons to humans. The architecture of an LLM is radically different from the architecture of a human brain, and the learning processes are also radically different. I don’t think a human could learn the structure of a new language by being exposed to a massive corpus and then trying to predict the next word. Our intuitions, which work well when dealing with humans, simply break down when we try to apply them to LLMs.

The late philosopher of mind Daniel Dennett proposed the concept of the intentional stance, as a perspective we take for predicting the behavior of things that we consider to be rational agents. To illustrate it, let’s contrast it with two other stances he mentions, the physical stance and the design stance. Consider the following three different scenarios, where you’re asked to make a prediction.

Scenario 1: Imagine that a child has rolled a ball up a long ramp which is at a 30 degree incline. I tell you that the ball is currently rolling up the ramp at 10 metres / second and ask you to predict what its speed will be one minute from now.

Scenario 2: Imagine a car is driving up a hill at a 10 degree incline. I tell you that the car is currently moving at a speed of 60 km/h, and that the driver has cruise control enabled, also set at 60 km/h. I ask you to predict the speed of the car one minute from now.

A car with cruise control enabled, driving uphill

Scenario 3: Imagine another car on a flat road that going at 50 km/h, and is about to enter an intersection, and the traffic light has just turned yellow. Another bit of information I give you: the driver is heading to an important job interview and is running late. Again, I ask you to predict the speed of the car one minute from now.

In the first scenario (ball rolling up a ramp), we can predict the ball’s future speed by treating it as a physics problem. This is what Dennett calls the physical stance.

In the second scenario (car with cruise control enabled), we view the car as an artifact that was designed to maintain its speed when cruise control is enabled. We can easily predict that its future speed will be 60 km/h. This is what Dennett calls the design stance. Here, we are using our knowledge that the car has been designed to behave in certain ways in order to predict how it will behave.

In the third scenario (driver running late who encounters a yellow light), we think about the intentions of the driver: they don’t want to be late for their interview, so we predict that they will accelerate through the intersection. We predict that the driver will accelerate through the intersection, and so we predict their future speed will be somewhere around 60 km/h. This is what Dennett calls the intentional stance. Here, we are using our knowledge of the desires and beliefs of the driver to predict what actions they will take.

Now, because LLMs have been designed to replicate human language, our instinct is to apply to the intentional stance to predict their behavior. It’s a kind of pareidolia, we’re seeing intentionality in a system that mimics human language output. Dennett was horrified by this.

But the design stance doesn’t really help us either, with LLMs. Yes, the design stance enables us to predict that an LLM-based chatbot will generate plausible-sounding answers to our questions, because that is what it was designed to do. But, beyond that, we can’t really reason about its behavior.

Generally, operational surprises are useful in teaching us how our system works by letting us observe circumstances in which it is pushed beyond its limits. For example, we might learn about a hidden limit somewhere in the system that we didn’t know about before. This is one of the advantages of doing incident reviews, and it’s also one of the reasons that psychologists study optical illusions. As Herb Simon put it in The Sciences of the Artificial, Only when [a bridge] has been overloaded do we learn the physical properties of the materials from which it is built.

However, when an LLM fails from our point of view by producing a plausible but incorrect answer to a question, this failure mode doesn’t give us any additional insight into how the LLM actually works. Because, in a real sense, that LLM is still successfully performing the task that it was designed to do: generate plausible-sounding answers. We aren’t capable of designing LLMs that only produce correct answers, we can only do plausible ones. And so we learn nothing about what we consider LLM failures, because the LLMs aren’t actually failing. They are doing exactly what they are designed to do.

Dijkstra never took a biology course

Simplicity is prerequisite for reliability. — Edsger W. Dijkstra

Think about a system whose reliability had significantly improved over some period of time. The first example that comes to my mind is commercial aviation, but I’d encourage you to think of a software system you’re familiar with, either as a user (e.g., Google, AWS) or as a maintainer of a system that’s gotten more reliable over time.

Think of a system where the reliability trend looks like this

Now, for the system you have thought about where its reliability increased over time, think about what the complexity trend looks like over time for that system. I’d wager you’d see a similar sort of trend.

My claim about what the complexity trend looks like over time

Now, in general, increases in complexity don’t lead to increases in reliability. In some cases, engineers make a deliberate decision to trade off reliability for new capabilities. The telephone system today is much less reliable than it was when I was younger. As someone who grew up in the 80s and 90s, the phone system was so reliable that it was shocking to pick up the phone and not hear a dial tone. We were more likely to experience a power failure than a telephony outage, and the phones still worked when the power was out! I don’t think we even knew the term “dropped call”. Connectivity issues with cell phones are much more common than they ever were with landlines. But this was a deliberate tradeoff: we gave up some reliability in order to have ubiquitous access to a phone.

Other times, the increase in complexity isn’t the product of an explicit tradeoff but rather an entropy-like effect of a system getting more difficult to deal with over time as it accretes changes. This scenario, the one that most people have in mind when they think about increasing complexity in their system, is synonymous with the idea of tech debt. With tech debt the increase in complexity makes the system less reliable, because the risk of making a breaking change in the system has increased. I started this blog post with a quote from Dijkstra about simplicity. Here’s another one, along the same lines, from C.A.R. Hoare’s Turing Award Lecture in 1980:

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.

What Dijkstra and Hoare are saying is: the easier a software system is to reason about, the more likely it is to be correct. And this is true: when you’re writing a program, the simpler the program is, the more likely that you are to get it right. However, as we scale up from individual programs to systems, this principle breaks down. Let’s see how that happens.

Djikstra claims simplicity is a prerequisite for reliability. According to Dijkstra, if we encounter a system that’s reliable, it must be a simple system, because simplicity is required to achieve reliability.

reliability ⇒ simplicity

The claim I’m making in this post is the exact opposite: systems that improve in reliability do so by adding features that improves reliability, but come at the cost of increased complexity.

reliability ⇒ complexity

Look at classic works on improving the reliability of real-world systems like Michael Nygard’s Release It!, Joe Armstrong’s Making reliable distributed systems in the presence of software errors, and Jim Gray’s Why Do Computers Stop and What Can Be Done About It? and think about the work that we do to make our software systems more reliable, functionality like retries, timeouts, sharding, failovers, rate limiting, back pressure, load shedding, autoscaling, circuit breakers, transactions, and auxiliary systems we have to support our reliability work like an observability stack. All of this stuff adds complexity.

Imagine if I took a working codebase and proposed deleting all of the lines of code that are involved in error handling. I’m very confident that this deletion of code would make the codebase simpler. There’s a reason that programming books tend to avoid error handling cases in their examples, they do increase complexity! But if you were maintaining a reliable software system, I don’t think you’d be happy with me if I submitted a pull request that deleted all of the error handling code.

Let’s look at the natural world, where biology provides us with endless examples of reliable systems. Evolution has designed survival machines that just keep on going; they can heal themselves in simply marvelous ways. We humans haven’t yet figured out how to design systems which can recover from the variety of problems that a living organism can. Simple, though, they are not. They are astonishingly, mind-boggling-y complex. Organisms are the paradigmatic example of complex adaptive systems. However complex you think biology is, it’s actually even more complex than that. Mother nature doesn’t care that humans struggle to understand her design work.

Now, I’m not arguing that this reliability-that-adds-complexity is a good thing. In fact, I’m the first person who will point out that this complexity in service of reliability creates novel risks by enabling new failure modes. What I’m arguing instead is that achieving reliability by pursuing simplicity is a mirage. Yes, we should pay down tech debt and simplify our systems by reducing accidental complexity: there are gains in reliability to be had through this simplifying work. But I’m also arguing that successful systems are always going to get more complex over time, and some of that complexity is due to work that improves reliability. Successful reliable systems are going to inevitably get more complex. Our job isn’t to reduce that complexity, it’s to get better at dealing with it.

The same incident never happens twice, but the patterns recur over and over

“No man ever steps in the same river twice. For it’s not the same river and he’s not the same man” – attributed to Heraclitus

After an incident happens, many people within the organization are worried about the same incident happening again. In one sense, the same incident can never really happen again, because the organization has changed since the incident has happened. Incident responders will almost certainly be more effective at dealing with a failure mode they’ve encountered recently than one they’re hitting for the first time.

In fairness, if the database falls over again, saying, “well, actually, it’s not the same incident as last time because we now have experience with the database falling over so we were able to recover more quickly” isn’t very reassuring to the organization. People are worried that there’s an imminent risk that remains unaddressed, and saying “it’s not the same incident as last time” doesn’t alleviate the concern that the risk has not been dealt with.

But I think that people tend to look at the wrong level of abstraction when they talk about addressing risks that were revealed by the last incident. They suffer from what I’ll call no-more-snow-goon-ism:

Calvin is focused on ensuring the last incident doesn’t happen again

Saturation is an example of a higher-level pattern that I never hear people talk about when focusing on eliminating incident recurrence. I will assert that saturation is an extremely common pattern in incidents: I’ve brought it up when writing about public incident writeups at Canva, Slack, OpenAI, Cloudflare, Uber, and Rogers. The reason you won’t hear people discuss saturation is because they are generally too focused on the specific saturation details of the last incident. But because there are so many resources you can run out of, there are many different possible saturation failure modes. You can exhaust CPU, memory, disk, threadpools, bandwidth, you can hit rate limits, you can even breach limits that you didn’t know existed and that aren’t exposed as metrics. It’s amazing how much different stuff there is that you can run out of.

My personal favorite pattern is unexpected behavior of a subsystem whose primary purpose was to improve reliability, and it’s one of the reasons I’m so bear-ish about the emphasis on corrective actions in incident reviews, but there are many other patterns you can identify. If you hit an expired certificate, you may think of “expired certificate” as the problem, but time-based behavior change is a more general pattern for that failure mode. And, of course, there’s the ever-present production pressure.

If you focus too narrowly on preventing the specific details of the last incident, you’ll fail to identify the more general patterns that will enable your future incidents. Under this narrow lens, all of your incidents will look like either recurrences of previous incidents (“the database fell over again!”) or will look like a completely novel and unrelated failure mode (“we hit an invisible rate limit with a vendor service!”). Without seeing the higher level patterns, you won’t understand how those very different looking incidents are actually more similar than you think.

LLMs are weird, man

The late science fiction author Arthur C. Clarke had a great line: Any sufficiently advanced technology is indistinguishable from magic. (This line inspired the related observation: any sufficiently advanced technology is indistinguishable from a rigged demo). Clarke was referring to scenarios where members of a civilization encounters technology developed by a different civilization. The Star Trek: The Next Generation episode titled Who Watches The Watchers is an example of this phenomenon in action. The Federation is surreptitiously observing the Mintakans, a pre-industrial alien society, when Federation scientists accidentally reveal themselves to the Mintakans. When the Mintakans witness Federation technology in action, they come to the conclusion that Captain Picard is a god.

LLMs are the first time I’ve encountered a technology that was developed by my own society where I felt like it was magic. Not magical in the “can do amazing things” sense, but magical in the “I have no idea how it even works” sense. Now, there’s plenty of technology that I interact with on a day-to-day basis that I don’t really understand in any meaningful sense. And I don’t just mean sophisticated technologies like, say, cellular phones. Heck, I’d be hard pressed to explain to you precisely how a zipper works. But existing technology feels in principle understandable to me, that if I was willing to put in the effort, I could learn how it works.

But LLMs are different, in the sense that nobody understands how they work, not even the engineers who designed them. Consider the human brain as an analogy for a moment: at some level, we understand how the human brain works, how it’s a collection of interconnected neuron cells arranged in various structures. We have pretty good models of how individual neurons behave. But if I asked you “how is the concept of the number two encoded in a human brain?”, nobody today could give a satisfactory answer to that. It has to be represented in there somehow, but we don’t quite know how.

Similarly, at the implementation level, we do understand how LLMs work: how words are encoded as vectors, how they are trained using data to do token prediction, and so on. But these LLMs perform cognitive tasks, and we don’t really understand how they do that via token predction. Consider this blog post from Anthropic from last month: Tracing the thoughts of a large language model. It talks about two research papers published by Anthropic where they are trying to understand how Claude (which they built!) performs certain cognitive tasks. They are trying to essentially reverse-engineer a system that they themselves built! Or, to use the analogy they use explicitly in the post, they are doing AI biology, they are approaching the problem of how Claude performs certain tasks the way that a biologist would approach the problem of how a particular organism performs a certain function.

Now, engineering researchers routinely study the properties of new technologies that humans have developed. For example, engineering researchers had to study the properties of solid-state devices like transistors, they didn’t know what those properties were just because they created them. But that’s different from the sort of reverse engineering kind of research that the Anthropic engineers are doing. We’ve built something to perform a very broad set of tasks, and it works (for various value of “works”), but we don’t quite know how. I can tell you exactly how a computer encodes the number two in either integer form (using two’s complement encoding) or in floating point form (using IEEE 754 encoding). But, just as I could not tell you how the human brain encodes the number two as concept, I could not tell you how Claude encodes the number two as a concept. I don’t even know if “concept” is a meaningful, well, concept, for LLMs.

There are two researchers who have won both the Turing Award and the Nobel Prize. The most recent winner is Geoffrey Hinton, who did foundational work in artificial neural networks, which eventually led to today’s LLMs. The other dual winner was also an AI researcher: Herbert Simon. Simon wrote a book called The Sciences of the Artificial, about how we should study artificial phenomena.

And LLMs are certainly artificial. We can argue philosophically about whether concepts in mathematics (e.g., the differential calculus) or theoretical computer science (e.g., the lambda calculus) are invented or discovered. But LLMs are clearly a human artifact, I don’t think anybody would argue that we “discovered” them. LLMs are a kind of black-box model of human natural language. We examine just the output of humans in the form of written language, and try to build a statistical model of it. Model here is a funny word. We generally think of models as a simplified view of reality that we can reason about: that’s certainly how scientists use models. But an LLM isn’t that kind of model. In fact, their behavior is so complex, that we have to build models of the model in order to do the work of trying to understand it. Or as the authors of one of the Anthropic papers puts it in On the Biology of a Large Language Model: Our methods study the model indirectly using a more interpretable “replacement model,” which incompletely and imperfectly captures the original.

As far as I’m aware, we’ve never had to do this sort of thing before. We’ve never engineered systems in such a way that we don’t fundamentally understand how they work. Yes, our engineered world contains many complex systems where nobody really understands how the entire system works, I write about that frequently in this blog. But I claim that this sort of non-understanding of LLMs on our part is different in kind from our non-understanding of complex systems.

Unfortunately, the economics of AI obscures the weirdness of the technology. There’s a huge amount of AI hype going on as VCs pour money into AI-based companies, and there’s discussion of using AI to replace humans for certain types of cognitive work. These trends, along with the large power consumption required by these AI models have, unsurprisingly, triggered a backlash. I’m looking forward to the end of the AI hype cycle, where we all stop talking about AI so damned much, when it finally settles in to whatever the equilibrium ends up being.

But I think it’s a mistake to write off this technology as just a statistical model of text. I think the word “just” is doing too much heavy lifting in that sentence. Our intuitions break down when we encounter systems beyond the scales of everyday human life, and LLMs are an example of that. It’s like saying “humans are just a soup of organic chemistry” (c.f. Terry Bisson’s short story They’re Made out of Meat). Intuitively, it doesn’t seem possible that evolution by natural selection would lead to conscious beings. But, somehow we humans are an emergent property of long chains of amino acids recombining, randomly changing, reproducing, and being filtered out by nature. The scale of evolution is so unimaginably long that our intuition of what evolution can do breaks down: we probably wouldn’t believe that such a thing was even possible if the evidence in support of it wasn’t so damn overwhelming. It’s worth noting here that one of the alternative approaches to AI was inspired by evolution by natural selection: genetic algorithms. However, this approach has proven much less effective than artificial neural networks. We’ve been playing with artificial neural networks on computers since the 1950s, and once we scaled up those artificial neural networks with large enough training sets and a large enough set of parameters, and we hit upon effective architectures, we achieved qualitatively different results.

Here’s another example of how our intuitions break down at scales outside of our immediate experience, this one borrowed from the philosophers Paul and Patricia Churchland in their criticism of John Searle’s Chinese Room argument. The Churchlands ask us to imagine a critic of James Clerk Maxwell’s electromagnetic theory by taking a magnet, shaking it backwards and forth, seeing no light emerge from the shaken magnet, and concluding that Maxwell’s theory is incorrect. Understanding the nature of light is particularly challenging for us humans, because it behaves at scales outside of the typical human ones, our intuitions are a hindrance rather than a help.

Just look at this post by Simon Willison about Claude’s system prompt. Ten years ago, if you had told me that a software company was configuring their behavior of their system with a natural language prompt, I would have laughed at you and told you, “that’s not how computers work.” We don’t configure conventional software by guiding it with English sentences and hoping that pushes it in a direction that results in more desirable outcomes. This is much closer to Isaac Asimov’s Three Laws of Robotics than we are to setting fields in a YAML file. According to my own intuitions, telling a computer in English how to behave shouldn’t work at all. And yet, here we are. It’s like the old joke about the dancing bear: it’s not that it dances well, but that it dances at all. I am astonished by this technology.

So, while I’m skeptical of the AI hype, I’m also skeptical of the critics that dismiss the AI technology too quickly. I think we just don’t understand this new technology well enough to know what it is actually capable of. We don’t know whether changes in LLM architecture will lead to only incremental improvements or could give us another order of magnitude. And we certainly don’t know what’s going to happen when people attempt to leverage the capabilities of this new technology.

The only thing I’m comfortable predicting is that we’re going to be surprised.

Postscript: I don’t use LLMs for generating the texts in my blog posts, because I use these posts specifically to clarify my own thinking. I’d be willing to use it as a copy-editor, but so far I’ve been unimpressed with WordPress’s “AI assistant: show issues & suggestions” feature. Hopefully that gets better over time.

I do find LLMs to often give me better results than search engines like Google or DuckDuckGo, but it’s still hit or miss.

For doing some of the research for this post, Claude was great at identifying the episode of Star Trek I was thinking of:

But it failed to initially identify either Herb Simon or Geoffrey Hinton as dual Nobel/Turing winners:

If I explicitly prompted Claude about the winners, it returned details about them.

Claude was also not useful at helping me identify the “shaking the magnet” critique of Searle’s Chinese Room. I originally thought that it came from the late philosopher Daniel Dennett (who was horrified at how LLMs can fool people into believing they are human). It turns out the critique came from the Churchlands, but Claude couldn’t figure that out, I ultimately found that out through using a DuckDuckGo search.