The trap of tech that’s great in the small but not in the large

There are software technologies that work really well in-the-small, but they don’t scale up well. The challenge here is that the problem size grows incrementally, and migrating off of them requires significant effort, and so locally it makes sense it to keep using them, but then you reach a point where you’re well into the size where they are a liability rather than an asset. Here are some examples.

Shell scripts

Shell scripts are fantastic in the small: throughout my career, I’ve written hundreds and hundreds of bash scripts that are twenty lines are less, typically closer than to ten, frequently less than even five lines. But, as soon as I need to write an if statement, that’s a sign to me that I should probably write it in something like Python instead. Fortunately, I’ve rarely encountered large shell scripts in the wild these days, with DevStack being a notable exception.

Makefiles

I love using makefiles as simple task runners. In fact, I regularly use just, which is like an even simpler version of make, and has similar syntax. And I’ve seen makefiles used to good effect for building simple Go programs.

But there’s a reason technologies like Maven, Gradle, and Bazel emerged, and it’s because large-scale makefiles are an absolute nightmare. Someone even wrote a paper called Recursive Make Considered Harmful.

YAML

I’m not a YAML hater, I actually like it for configuration files that are reasonably sized, where “reasonably sized” means something like “30 lines or fewer”. I appreciate support for things like comments and not having to quote strings.

However, given how much of software operations runs on YAML these days, I’ve been burned too many times by having to edit very large YAML files. What’s human-readable in the small isn’t human-readable is the large.

Spreadsheets

The business world runs on spreadsheets: they are the biggest end-user programming success story in human history. Unfortunately, spreadsheets sometimes evolve into being de facto databases, which is terrifying. The leap required to move from using a spreadsheet as your system of record to a database is huge, which explains why this happens so often.

Markdown

I’m a big fan of Markdown, but I’ve never tried to write an entire book with it. I’m going to outsource this example to Hillel Wayne, see his post Why I prefer rST to markdown: I will never stop dying on this hill.

Formal specs as sets of behaviors

Amazon’s recent announcement of their spec-driven AI tool, Kiro, inspired me to write a blog post on a completely unrelated topic: formal specifications. In particular, I wanted to write about how a formal specification is different from a traditional program. It took a while for this idea to really click in my own head, and I wanted to motivate some intuition here.

In particular, there have been a number of formal specification tools that have been developed in recent years which use programming-language-like notation, such as FizzBee, P, PlusCal, and Quint. I think these notations are more approachable for programmers than the more set-theoretic notation of TLA+. But I think the existence of programming-language-like formal specification languages makes it even more important to drive home the difference between a program and a formal spec.

The summary of this post is: a program is a list of instructions, a formal specification is a set of behaviors. But that’s not very informative on its own. Let’s get into it.

What kind of software do we want to specify

Generally speaking, we can divide the world of software into two types of programs. There is one type where you give the program a single input, and it produces a single output, and then it stops. The other type is one that runs for an extended period of time and interacts with the world by receiving inputs over time, and generating outputs over time. In a paper published in the mid 1980s, the computer scientists David Harel (developer of statecharts) and Amir Pneuli (the first person to apply temporal logic to software specifications) made a distinction between programs they called transformational (which is like the first kind) and the another which they called reactive.

Source: On the Development of Reactive Systems by Harel and Pnueli

A compiler is an example of a transformational tool, but you can think of many command-line tools as falling into this category. An example of the second type is the flight control software in an airplane, which runs continuously, taking in inputs and generating outputs over time. In my world, we call services are a great example of reactive systems. They’re long-running programs that receive requests as inputs and generate responses as outputs. The specifications that I’m talking about here apply to the more general reactive case.

A motivating example: a counter

Let’s consider the humble counter as an example of a system whose behavior we want to specify. I’ll describe what operations I want my counter to support using Python syntax:

class Counter:
  def inc() -> None:
    ...
  def get() -> int:
    ...
  def reset() -> None:
    ...

My example will be sequential to keep things simple, but all of the concepts apply to specifying concurrent and distributed systems as well. Note that implementing a distributed counter is a common system design interview problem.

Behaviors

Above I just showed the method signatures, but I implemented this counter and interacted with it in the Python REPL, here’s what that looked like:

>>> c = Counter()
>>> c.inc()
>>> c.inc()
>>> c.inc()
>>> c.get()
3
>>> c.reset()
>>> c.inc()
>>> c.get()
1

People sometimes refer to the sort of thing above by various names: a session, an execution, an execution history, an execution trace. The formal methods people refer to this sort of thing as a behavior, and that’s the term that we’ll use in the rest of this post. Specifications are all about behaviors.

Sometimes I’m going to draw behaviors in this post. I’m going to denote a behavior as a squiggle.

To tie this back to the discussion about reactive systems, you can think of method invocation as inputs, and return values as outputs. The above example is a correct behavior for our counter. But a behavior doesn’t have to be correct: a behavior is just an arbitrary sequence of inputs and outputs. Here’s an example of an incorrect behavior for our counter.

>>> c = Counter()
>>> c.inc()
>>> c.get()
4

We expected the get method to return 1, but instead it returned 4. If we saw that behavior, we’d say “there’s a bug somewhere!”

Specifications and behaviors

What we want out of a formal specification is a device that can answer the question: “here’s a behavior: is it correct or not?”. That’s what a formal spec is for a reactive system. A formal specification is an entity such that, given a behavior, we can determine whether the behavior satisfies the spec. Correct = satisfies the specification.

Once again, a spec is a thing that will tell us whether or not a given behavior is correct.

A spec as a set of behaviors

I depicted a spec in the diagram above as, literally, a black box. Let’s open that box. We can think of a specification simply as a set that contains all of the correct behaviors. Now, the “correct?” processor above is just a set membership check: all it does it check if behavior is an element of the set spec.

What could be simpler?

Note that this isn’t a simplification: this is what a formal specification is in a system like TLA+. It’s just a set of behaviors: nothing more, nothing less.

Describing a set of behaviors

You’re undoubtedly familiar with sets. For example, here’s a set of the first three positive natural numbers: \{1,2,3\}. Here, we described the set by explicitly enumerating each of the elements.

While the idea of a spec being a set of behaviors is simple, actually describing that set is trickier. That’s because we can’t explicitly enumerate the elements of the set like we did above. For one thing, each behavior is, in general, of infinite length. Taking the example of our counter, one valid behavior is to just keep calling any operation over and over again, ad infinitum.

>>> c = Counter()
>>> c.get()
0
>>> c.get()
0
>>> c.get()
0
... (forever)

A behavior of infinite length

This is a correct behavior for our counter, but we can’t write it out explicitly, because it goes on forever.

The other problem is that the specs that we care about typically contain an infinite number of behaviors. If we take the case of a counter, for any finite correct behavior, we can always generate a new correct behavior by adding another inc, get, or reset call.

So, even if we restricted ourselves to behaviors of finite length, if we don’t restrict the total length of a behavior (i.e., if our behaviors are finite but unbounded, like natural numbers), then we cannot define a spec by explicitly enumerating all of the behaviors in the specification.

And this is where formal specification languages come in: they allow us to define infinite sets of behaviors without having to explicitly enumerate every correct behavior.

Describing infinite sets by generating them

Mathematicians deal with infinite sets all of the time. For example, we can use set-builder notation to describe the infinitely large set of all even natural numbers without explicitly enumerating each one:

\{2k \mid k \in \mathbb{N}\}

The example above references another infinite set, the set of natural numbers (ℕ). How do we generate that infinite set without reference to another one?

One way is to define the set by describing how to generate the set of natural numbers. To do this, we specify:

  1. an initial natural number (either 0 or 1, depending on who you ask)
  2. a successor function for how to generate a new natural number from an existing one

This allows us to describe the set of natural numbers without having to enumerate each one explicitly. Instead, we describe how to generate them. If you remember your proofs by induction from back in math class, this is like defining a set by induction.

Specifications as generating a set of behaviors

A formal specification language is just a notation for describing a set of behaviors by generating them. In TLA+, this is extremely explicit. All TLA+ have two parts:

  • Init – which describes all valid initial states
  • Next – which describes how to extend an existing valid behavior to one or more new valid behavior(s)

Here’s a visual representation of generating correct behaviors for the counter.

Generating all correct behaviors for our counter

Note how in the case of the counter, there’s only one valid initial state in a behavior: all of the correct behaviors start the same way. After that, when generating a new behavior based on a previous one, whether one behavior or multiple behaviors can be generated depends on the history. If the last event was a method invocation, then there’s only one valid way to extend that behavior, which is the expected response of the request. If the last event was a return of a method, then you can extend the behavior in three different ways, based on the three different methods you can call on the counter.

The (Init, Next) pair describe all of the possible correct behaviors of the counters by generating them.

Nondeterminism

One area where formal methods can get confusing for newcomers is that the notation for writing the behavior generator can look like a programming language, particularly when it comes to nondeterminism.

When you’re writing a formal specification, you want to express “here are all of the different ways that you can validly extend this behavior”, hence you get that branching behavior in the diagram in the previous section: you’re generating all of the possible correct behaviors. In a formal specification, when we talk about “nondeterminism”, we mean “there are multiple ways a correct behavior can be extended”, and that includes all of the different potential inputs that we might receive from outside. In formal specifications, nondeterminism is about extending a correct behavior along multiple paths.

On the other hand, in a computer program, when we talk about code being nondeterministic, we mean “we don’t know which path the code is going to take”. In the programming world, we typically use nondeterminism to refer to things like random number generation or race conditions. One notable area where they’re different is that formal specifications treat inputs as a source of nondeterminism, whereas programmers don’t include inputs when they talk about nondeterminism. If you said “user input is one of the sources of nondeterminism”, a formal modeler would nod their head, and a programmer would look at you strangely.

Properties of a spec: sets of behaviors

I’ve been using the expressions correct behavior and behavior satisfies the specification interchangeably. However, in practice, we build formal specifications to help us reason about the correctness of the system we’re trying to build. Just because we’ve written a formal specification doesn’t mean that the specification is actually correct! That means that we can’t treat the formal specification that we build as the correct description of the system in general.

The most frequent tactic people use to reason about their formal specifications is to define correctness properties and use a model-checking tool to check whether their specification conforms to the property or not.

Here’s an example of a property for our counter: the get operation always returns a non-negative value. Let’s give it a name: the no-negative-gets property. If our specification has this property, we don’t know for certain it’s correct. But if it doesn’t have this property, we know for sure something is wrong!

Like a formal specification, a property is nothing more than a set of behaviors! Here’s an example of a behavior that satisfies the no-negative-gets property:

>>> c = Counter()
>>> c.get()
0
>>> c.inc()
>>> c.get()
1

And here’s another one:

>>> c = Counter()
>>> c.get()
5
>>> c.inc()
>>> c.get()
3

Note that the second wrong probably looks wrong to you. We haven’t actually written out a specification for our counter in this post, but if we did, the behavior above would certainly violate it: that’s not how counters work. On the other hand, it still satisfies the no-negative-gets property. In practice, the set of behaviors defined by a property will include behaviors that aren’t in the specification, as depicted below.

A spec that satisfies a property.

When we check that that a spec satisfies a property, we’re checking that Spec is a subset of Property. We just don’t care about the behaviors that are in the Property set but not in the Spec set. What we care about are behaviors that are in Spec that are not in Property. That tells us that our specification can generate behaviors that do not possess the property that we care about.

A spec that does not satisfy a property

Consider the property: get always return a positive number. We can call it all-positive-gets. Note that zero is not considered a positive number. Assuming our counter specification starts at zero, here’s a behavior that violates the all-positive-gets property:

>>> c = Counter()
>>> c.get()
0

Thinking in sets

When writing formal specifications, I found that thinking in terms of sets of behaviors was a subtle but significant mind-shift from thinking in terms of writing traditional programs. Where it helped me most is in making sense of the errors I get when debugging my TLA+ specifications using the TLC model checker. After all, it’s when things break is when you really need to understand whats’s going on under the hood. And I promise you, when you write formal specs, things are going to break. That’s why we write them, to find where the breaks are.

Cloudflare and the infinite sadness of migrations

(With apologies to The Smashing Pumpkins)

A few weeks ago, Cloudflare experienced a major outage of their popular 1.1.1.1 public DNS resolver.

On July 14th, 2025, Cloudflare made a change to our service topologies that caused an outage for 1.1.1.1 on the edge, resulting in downtime for 62 minutes for customers using the 1.1.1.1 public DNS Resolver as well as intermittent degradation of service for Gateway DNS.

Cloudflare (@cloudflare.social) 2025-07-16T03:45:10.209Z

Technically, the DNS resolver itself was working just fine: it was (as far as I’m aware) up and running the whole time. The problem was that nobody on the Internet could actually reach it. The Cloudflare public write-up is quite detailed, and I’m not going to summarize it here. I do want to bring up one aspect of their incident, because it’s something I worry about a lot from a reliability perspective: migrations.

Cloudflare’s migration

When this incident struck, Cloudflare supported two different ways of managing what they call service topologies. There was a newer system that supported progressive rollout, and an older system where the changes occurred globally. The Cloudflare incident involved the legacy system, which makes global changes, which is why the blast radius of this incident was so large.

Source: https://blog.cloudflare.com/cloudflare-1-1-1-1-incident-on-july-14-2025/

Cloudflare engineers were clearly aware that these sorts of global changes are dangerous. After all, I’m sure that’s one of the reasons why they built their new system in the first place. But migrating all of the way to the new thing takes time.

Migrations and why I worry about them

If you’ve ever worked at any sort of company that isn’t a startup, you’ve had to deal with a migration. Sometimes a migration impacts only a single team that owns the system in question, but often migrations are changes that are large in scope (typically touching many teams) which, while providing new capabilities to the organization as a whole, don’t provide much short-term benefit to the teams who have to make a change to accommodate the migration.

A migration is a kind of change that, almost by definition, the system wasn’t originally designed to accommodate. We build our systems to support making certain types of future changes, and migrations are exactly not these kinds of changes. Each migration is typically a one-off type of change. While you’ll see many migrations if you work at a more mature tech company, each one will be different enough that you won’t be able to leverage common tooling from one migration to help make the next one easier.

All of this adds up to reliability risk. While a migration-related change wasn’t a factor in the Cloudflare incident, I believe that such changes are inherently risky, because you’re making a one-off change to the way that your system works. Developers generally have a sense that these sorts of changes are risky. As a consequence, for an individual on a team who has to do work to support somebody else’s migration, all of the incentives push them towards dragging their feet: making the migration-related change takes time away from their normal work, and increases the risk they break something. On the other hand, completing the migration generally doesn’t provide them short-term benefit. The costs typically outweigh the benefits. And so all of the forces push towards migrations taking a long time.

But a delay in implementing a migration is also a reliability risk, since migrations are often used to improve the reliability of the system. The Cloudflare incident is a perfect example of this: the newer system was safer than the old one, because it supported staged rollout. And while they ran the new system, they had to run the old one as well.

Why run one system when you can run two?

The scariest type of migration to me is the big bang migration, where you cut over all at once from the old system to the new one. Sometimes you have no choice, but it’s an approach that I personally would avoid whenever possible. The alternative is to do incremental migration, migrating parts of the system over time. To do incremental migration, you need to run the old system and the new system concurrently, until you’ve completely finished the migration and can shut the old system down. When I worked at Netflix, people used the term Roman riding to refer to running the old and new system in parallel, in reference to a style of horseback riding.

What actual Roman riding looks like

The problem with Roman riding is that it’s risky as well. While incremental is safer than big bang, running two systems concurrently increases the complexity of the system. There are many, many opportunities for incidents while you’re in the midst of a migration running the two systems in parallel.

What is to be done?

I wish I had a simple answer here. But my unsatisfying one is that engineering organizations at tech companies need to make migrations a part of their core competency, rather than seeing them as one-off chores. I frequently joke that platform engineering should really be called migration engineering, because any org large enough to do platform engineering is going to be spending a lot of its cycles doing migrations.

Migrations are also unglamorous work: nobody’s clamoring for the title of migration engineer. People want to work on greenfield projects, not deal with the toil of a one-off effort to move the legacy thing onto the new thing. There’s also not a ton written on doing migrations. A notable exception is (fellow TLA+ enthusiast) Marianne Bellotti’s book Kill It With Fire, which sits on my bookshelf, and which I really should re-read.

I’ll end this post with some text from the “Remediation and follow-up steps” of the Cloudflare writeup:

We are implementing the following plan as a result of this incident:

Staging Addressing Deployments: Legacy components do not leverage a gradual, staged deployment methodology. Cloudflare will deprecate these systems which enables modern progressive and health mediated deployment processes to provide earlier indication in a staged manner and rollback accordingly.

Deprecating Legacy Systems: We are currently in an intermediate state in which current and legacy components need to be updated concurrently, so we will be migrating addressing systems away from risky deployment methodologies like this one. We will accelerate our deprecation of the legacy systems in order to provide higher standards for documentation and test coverage.

I’m sure they’ll prioritize this particular migration because of the attention garnered on it from this incident. But I also bet there are a whole lot more in-flight migrations at Cloudflare, as well as at other companies, that increase complexity through maintaining two systems and delaying moving to the safer thing. What are they actually going to do in order to complete those other migrations more quickly? If it was easy, it would already be done.

Re-reading Technopoly

Technopoly by Neil Postman, published in 1993

Can language models be too big? asked the researchers Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell in their famous Stochastic Parrots paper about LLMs, back in 2021. Technopoly is Neil Postman’s answer to that question, despite it beoing written back in the mid-nineties.

Postman is best known for his 1985 book Amusing Ourselves to Death, about the impact of television on society. Postman passed away in 2003, one year before Facebook was released, and two years before YouTube. This was probably for the best, as social media and video sharing services like Instagram and TikTok would have horrified him, being the natural evolution of the trends he was writing about in the 1980s.

The rise of LLMs inspired me to recently re-read Technopoly. In what Postman calls technopoly, technological progress becomes the singular value that society pursues. A technopoly treats access to information as an intrinsic good: more is always better. As a consequence, it values removing barriers to the collection and transmission of information; Postman uses the example of the development of the telegraph as a technology that eliminated distance as an information constraint.

The collection and transmission of information is central to Postman’s view of technology, the book focuses entirely on such technologies, such as writing, the stethoscope, the telescope, and the computer; he would have been very comfortable with our convention of referring to software-based companies as tech companies. Consider Google’s stated mission: to organize the world’s information and make it universally accessible and useful. This statement makes for an excellent summary of the value system of technopoly. In a technopoly, the solutions to our problems can always be found by collecting and distributing more information.

More broadly, Postman notes that the worldview of technopoly is captured in Frederick Taylor’s principles of scientific management:

  • the primary goal of work is efficiency
  • technical calculation is always to be preferred over human judgment, which is not trustworthy
  • subjectivity is the enemy of clarity of thought
  • what cannot be measured can be safely ignored, because it either does not exist, or it has no value

I was familiar with Taylor’s notion of scientific management before, but it was almost physically painful for me to see its values laid out explicitly like this, because it describes the wall that I so frequently crash into when I try to advocate for a resilience engineering perspective on how to think about incidents and reliability. Apparently, I am an apostate in the Church of Technopoly.

Postman was concerned about the harms that can result from treating more information as an unconditional good. He worried about information for its own sake, divorced of human purpose and stripped of its constraints, context, and history. Facebook ran headlong into the dangers of unconstrained information transmission when its platform was leveraged in Myanmar to promote violence. In her memoir Careless People, former Facebook executive Sarah Wynn-Williams documents how Facebook as an organization was fundamentally unable to deal with the negative consequences of the platform that they had constructed. Wynn-Williams focuses on the moral failures of the executive leadership of Facebook, hence the name of the book. But Postman would also indict technopoly itself, the value system that Facebook was built on, with its claims that disseminating information is always good. In a technopoly, reducing obstacles to information access is always a good thing.

Technopoly as a book is weakest in its critique of social science. Postman identifies social scientists as a class of priests in a technopoly, the experts who worship technopoly’s gods of efficiency, precision, and objectivity. His general view of social science research results are that they are all either obviously true or absurdly false, where the false ones are believed because they come from science. I think Postman falls into the same trap as the late computer scientist Edsger Dijkstra in discounting the value of social science, both in Duncan Watts’s sense of Everything is Obvious: Once you Know the Answer and in the value of good social science protecting us from bad social science. I say this as someone who draws from social science research every day when I examine an incident. Given Postman’s role as a cultural critic, I suspect that there’s some “hey, you’re on my turf!” going on here.

Postman was concerned that technopoly is utterly uninterested in human purpose or a coherent worldview. And he’s right that social science is silent on both matters. But his identification of social scientists as technopoly’s priests hasn’t borne out. Social science certainly has its problems, with the replication crisis in psychology being a glaring example. But that’s a crisis that undermines faith in psychology research, whereas Postman was worried about people putting too much trust in the outcomes of psychology research. I’ll note that the first author of the Stochastic Parrots paper, Emily Bender, is a linguistics professor. In today’s technopoloy, there are social scientists that are pushing back on the idea that more information is always better.

Overall, the book stands up well, and is even more relevant today than when it was originally published, thirty-odd years ago. While Postman did not foresee the development of LLMs, he recognized that maximizing the amount of accessible information will not be the benefit to mankind that that its proponents claim. That we so rarely hear this position advocated is a testament to his claim that we are, indeed, living in a technopoly.

Component defects: RCA vs RE

Let’s play another round where contrast the root-cause-analysis (RCA) perspective to the resilience engineering (RE) perspective. Today’s edition is about the distribution of potentially incident-causing defects across the different components in the system. Here, I’m using RCA nomenclature, since the kinds of defects that an RCA advocate would refer to as a “cause” in the wake of an incident would be called a “contributor” by the RE folks.

Here’s a stylized view of the world from the RCA perspective:

RCA view of distribution of potential incident-causing defects in the system

Note that there are a few particularly problematic components: we should certainly focus our reliability efforts on figuring out which of the components we should be spending our reliability efforts on improving!

Now let’s look at the RE perspective:

RE view of distribution of potential incident-contributing defects in the system

It’s a sea of red! The whole system is absolutely shot through with defects that could contribute to an incident!

Under the RE view, the individual defects aren’t sufficient to cause an incident. Instead, it’s an interaction of these defects with other things, including other defects. Because incidents arise due to interactions, RE types will stress the importance of understanding interactions across components over the details of the specific component that happened to contain the defect that contributed to the outage. After all, according to RE folks, those defects are absolutely everywhere. Focusing on one particular component won’t yield significant improvements under this model.

If you want to appreciate the RE perspective, you need to develop an understanding how it can be that the system is up right now despite the fact that it is absolutely shot through with all of these potentially incident-causing defects, as the RCA folks would call them. As an RE type myself, I believe that your system is up right now, and that it already contains the defect that will be implicated in the next incident. After that incident happens, the tricky part isn’t identifying the defect, it’s appreciating how the defect alone wasn’t enough to bring it down.

“What went well” is more than just a pat on the back

When writing up my impressions of the GCP incident report, Cindy Sridharan’s tweet reminded me that I failed to comment on an important part of it, how the responders brought the overloaded system back to a healthy state.

Which brings me to the topic of this post: the “what went well” section of an incident write-up. Generally, public incident write-ups don’t have such sections. This is almost certainly for rational political reasons: it would be, well, gauche to recount to your angry customers about what a great job you did handling the incident. However, internal write-ups often have such sections, and that’s my focus here.

In my experience, “What went well” is typically the shortest section in the entire incident report, with a few brief bullet points that point out some positive aspects of the response (e.g., people responded quickly). It’s a sort of way-to-go!, a way to express some positive feedback to the responders on a job well done. This is understandable, as people believe that if we focus more on what went wrong than what went well, then we are more likely to improve the system, because we are focusing on repairing problems. This is why “what went wrong” and “what can we do to fix it” takes the lion’s share of the attention.

But the problem with this perspective is that it misunderstands the skills that are brought to bear during incident response, and how learning from a previously well-handled incident can actually help other responders do better in future incidents. Effective incident response happens because the responders are skilled. But every incident response team is an ad-hoc one, and just because you happened to have people with the right set of skills responding last time, doesn’t mean you’ll have the people with the right set the next time. This means that if you gloss over what went well, your next incident might be even worse than the last one, because you’ve described those future responders of the opportunity to learn from observing the skilled responders last time.

To make this more concrete, let’s look back at that the GCP incident report. In this scenario, the engineers had put in a red-button as a safety precaution and exercised it to remediate the audience.

As a safety precaution, this code change came with a red-button to turn off that particular policy serving path… Within 2 minutes, our Site Reliability Engineering team was triaging the incident. Within 10 minutes, the root cause was identified and the red-button (to disable the serving path) was being put in place. 

However, that’s not the part that interests me so much. Instead, it’s the part about how the infrastructure became overloaded as a consequence of the remediation, and how the responders recovered from overload.

Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure…. It took up to ~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional databases to reduce the load.

This was not a failure scenario that they had explicitly designed for in advance of deploying the change: there was no red-button they could simply exercise to roll back the system to a non-overloaded state. Instead, they were forced to improvise a solution based on the controls that were available to them. In this case, they were able to reduce the load by turning down the rate of task creation, as well as by re-routing traffic away from the overloaded database.

And this sort of work is the really interesting bit an incident: how skilled responders are able to take advantage of generic functionality that is available in order to remediate an unexpected failure mode. This is one of the topics that the field of resilience engineering focuses on, how incident responders are able to leverage generic capabilities during a crunch. If I was an engineer at Google in this org, I would be very interested to learn what knobs are available and how to twist them. Describing this in detail in an incident write-up will increase my chances of being able to leverage this knowledge later. Heck, even just leaving bread crumbs in the doc will help, because I’ll remember the incident, look up the write-up, and follow the links.

Another enormously useful “what went well” aspect that often gets short shrift is a description of the diagnostic work: how the responders figured out what was going on. This never shows up in public incident write-ups, because the information is too proprietary, so I don’t blame Google for not writing about how the responders determined the source of the overload. But all too often these details are left out of the internal write-ups as well. This sort of diagnostic work is a crucial set of skills for incident response, and having the opportunity to read about how experts applied their skills to solve this problem help transfers these skills across the organization.

Here’s my claim: providing details on how things went well will reduce your future mitigation time even more than focusing on what went wrong. While every incident is different, the generic skills are common, and so getting better at response will get you more mileage than preventing repeats of previous incidents. You’re going to keep having incidents over and over. The best way to get better at incident handling is to handle more incidents yourself. The second best way is to watch experts handle incidents. The better you do at telling the stories of how your incidents were handled, the more people will learn about how to handle incidents.

Quick takes on the GCP public incident write-up

On Thursday (2025-06-12), Google Cloud Platform (GCP) had an incident that impacted dozens of their services, in all of their regions. They’ve already released an incident report (go read it!), and here are my thoughts and questions as I read it.

Note that the questions I have shouldn’t be explicitly seen as a critique as of the write-up, as the answers to the questions generally aren’t publicly shareable. They’re more in the “I wish I could be a fly on the wall inside of Google” questions.

Quick write-up

First, a meta-point: this is a very quick turnaround for a public incident write-up. As a consumer of these, I of course appreciate getting it faster, and I’m sure there was enormous pressure inside of the company to get a public write-up published as soon as possible. But I also think there are hard limits on how much you can actually learn about an incident when you’re on the clock like this. I assume that Google is continuing to investigate internally how the incident happened, and I hope that they publish another report several weeks from now with any additional details that they are able to share publicly.

Staging land mines across regions

Note that impact (June 12) happened two weeks after deployment (May 29).

This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code.

The system involved is called Service Control. Google stages their deploys of Service Control by region, which is a good thing: staging your changes is a way of reducing the blast radius if there’s a problem with the code. However, in this case, the problematic code path was not exercised during the regional rollout. Everything looked good in the first region, and so they deployed to the next region, and so on.

This the land mine risk: when the code you are rolling out contains a land mine which is not tripped during the rollout.

How did the decisions make sense at the time?

I have no information about how this incident came to be but I can confidently predict that people will blame it on greedy execs and sloppy devs, regardless of what the actual details are. And they will therefore learn nothing from the details.

Lorin Hochstein (@norootcause.surfingcomplexity.com) 2024-07-19T19:17:47.843Z

The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash.

This is the typical “we didn’t do X in this case and had we done X, this incident wouldn’t have happened, or wouldn’t have been as bad” sort of analysis that is very common in these write-ups. The problem with this is that it implies sloppiness on the part of the engineers, that important work was simply overlooked. We don’t have any sense on how the development decisions made sense at the time.

If this scenario was atypical (i.e., usually error handling and feature flags are added), what was different about this development case? We don’t have the context about what was going on during development, which means we (as external readers) can’t understand how this incident actually was enabled.

Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging.

How do they know it would have been caught in staging, if it didn’t manifest in production until two weeks after roll-out? Are they saying that adding a feature flag would have led to manual testing of the problematic code path in staging? Here I just don’t know enough about Google’s development processes to make sense of this observation.

Service Control did not have the appropriate randomized exponential backoff implemented to avoid [overloading the infrastructure].

As I discuss later, I’d wager it’s difficult to test for this in general, because the system generally doesn’t run in the mode that would exercise this. But I don’t have the context, so it’s just a guess. What’s the history behind Service Control’s backoff behavior? By definition, Without knowing its history, we can’t really understand how its backoff implementation came to be this way.

Red buttons and feature flags

As a safety precaution, this code change came with a red-button to turn off that particular policy serving path. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. (emphasis added)

Because I’m unfamiliar with Google’s internals, I don’t understand how their “red button” system works. In my experience, the “red button” type functionality is built on top of feature flag functionality, but that does not seem to be the case at Google, since here there was no feature flag, but there was a big red button.

It’s also interesting to me that, while this feature wasn’t feature-flagged it was big-red-buttoned. There’s a story here! But I don’t know what it is.

New feature: additional policy quota checks

On May 29, 2025, a new feature was added to Service Control for additional quota policy checks… On June 12, 2025 at ~10:45am PDT, a policy change was inserted into the regional Spanner tables that Service Control uses for policies.

I have so many questions.. What were these additional quota policy checks? What was the motivation for adding these checks (i.e., what problem are the new checks addressing)? Is this customer-facing functionality (e.g., GCP Cloud Quotas), or is this an internal-only? What was the purpose of the policy change that was inserted on June 12 (or was it submitted by a customer)? Did that policy change take advantage of the new Service Control features that were added on May 29? Was that the first policy change that happened since the new feature was deployed, or had there been others? How frequently do policy changes happen?

Global data changes

Code changes are scary, config changes are scarier, and data changes are the scariest of them all.

Lorin Hochstein (@norootcause.surfingcomplexity.com) 2025-06-14T19:32:32.669Z

Given the global nature of quota management, this metadata was replicated globally within seconds.

While code and feature flag changes are staged across regions, apparently quota management metadata is designed to replicate globally.

Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues. (emphasis mine)

The implication I take from from the text was that there was a business requirement for quota management data changes to happen globally rather than staged, and that they are now going to push back on that.

What was the rationale for this business requirement? What are the tradeoffs involved in staging these changes versus having them happen globally? What new problems might arise when data changes are staged like this?

Are we going to be reading a GCP incident report in a few years that resulted from inconsistency of this data across regions due to this change?

Saturation!

From an operational perspective, I remain terrified of databases

Lorin Hochstein (@norootcause.surfingcomplexity.com) 2025-06-13T17:21:16.810Z

Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure.

Here we have a classic example of saturation, where a database got overloaded. Note that saturation wasn’t the trigger here, but it made recovery more difficult. Our system is in a different mode during incident recovery than it is during normal mode, and it’s generally very difficult to test for how it will behave when it’s in recovery mode.

Does this incident match my conjecture?

I have a long-standing conjecture that once a system reaches a certain level of reliability, most major incidents will involve:

  • A manual intervention that was intended to mitigate a minor incident, or
  • Unexpected behavior of a subsystem whose primary purpose was to improve reliability

I don’t have enough information in this write-up to be able to make a judgment in this case: it depends on whether or not the quota management system’s purpose is to improve reliability. I can imagine it going either way. If it’s a public-facing system to help customers limit their costs, then that’s more of a traditional feature. On the other hand, if it’s to limit the blast radius of individual user activity, then that feels like a reliability improvement system.

What are the tradeoffs of the corrective actions?

The write-up lists seven bullets of corrective actions. The questions I always have of corrective actions are:

  • What are the tradeoffs involved in implementing these corrective actions?
  • How might they enable new failure modes or make future incidents more difficult to deal with?

AI at Amazon: a case study of brittleness

A year ago, Mihail Eric wrote a blog post detailing his experiences working on AI inside Amazon: How Alexa Dropped the Ball on Being the Top Conversational System on the Planet. It’s a great first-person account, with lots of detail of the issues that kept Amazon from keeping up with its peers in the LLM space. From my perspective, Eric’s post makes a great case study in what resilience engineering researchers refer to as brittleness, which is a term that the researchers use to refer to as a kind of opposite of resilience.

In the paper Basic Patterns in How Adaptive Systems Fail, the researchers David Woods and Matthieu Branlat note that brittle systems tend to suffer from the following three patterns:

  1. Decompensation: exhausting capacity to adapt as challenges cascade
  2. Working at cross-purposes: behavior that is locally adaptive but globally maladaptive
  3. Getting stuck in outdated behaviors: the world changes but the system remains stuck in what were previously adaptive strategies (over-relying on past successes)

Eric’s post demonstrates how all three of these patterns were evident within Amazon.

Decompensation

It would take weeks to get access to any internal data for analysis or experiments
Experiments had to be run in resource-limited compute environments. Imagine trying to train a transformer model when all you can get a hold of is CPUs. Unacceptable for a company sitting on one of the largest collections of accelerated hardware in the world.

If you’ve ever seen a service fall over after receiving a spike in external requests, you’ve seen a decompensation system failure. This happens when a system isn’t able to keep up with the demands that are placed upon on it.

In organizations, you can see the decompensation failure pattern emerge when decision-making is very hierarchical: you end up having to wait for the decision request to make its way up to someone who has the authority to make the decision, and then make its way down again. In the meantime, the world isn’t standing still waiting for that decision to be made.

As described in the Bad Technical Process section of Eric’s post, Amazon was not able to keep up with the rate at which its competitors were making progress on developing AI technology, even though Amazon had both the talent and the compute resources necessary in order to make progress. The people inside the organization who needed the resources weren’t able to get them in a timely fashion. That slowed down AI development and, consequently, they got lapped by their competitors.

Working at cross-purposes

Alexa’s org structure was decentralized by design meaning there were multiple small teams working on sometimes identical problems across geographic locales.

This introduced an almost Darwinian flavor to org dynamics where teams scrambled to get their work done to avoid getting reorged and subsumed into a competing team.

The consequence was an organization plagued by antagonistic mid-managers that had little interest in collaborating for the greater good of Alexa and only wanted to preserve their own fiefdoms.

My group by design was intended to span projects, whereby we found teams that aligned with our research/product interests and urged them to collaborate on ambitious efforts. The resistance and lack of action we encountered was soul-crushing.

Where decompensation is a consequence of poor centralization, working at cross-purposes is a consequence of poor decentralization. In a decentralized organization, the individual units are able to work more quickly, but there’s a risk of alignment: enabling everyone to row faster isn’t going to help if they’re rowing in different directions.

In the Fragmented Org Structures section of Eric’s writeup, he goes into vivid, almost painful detail about how Amazon’s decentralized org structure worked against them.

Getting stuck in outdated behaviors

Alexa was viciously customer-focused which I believe is admirable and a principle every company should practice. Within Alexa, this meant that every engineering and science effort had to be aligned to some downstream product.

That did introduce tension for our team because we were supposed to be taking experimental bets for the platform’s future. These bets couldn’t be baked into product without hacks or shortcuts in the typical quarter as was the expectation.

So we had to constantly justify our existence to senior leadership and massage our projects with metrics that could be seen as more customer-facing.

This introduced product/science conflict in every weekly meeting to track the project’s progress leading to manager churn every few months and an eventual sunsetting of the effort.

I’m generally not a fan of management books, but What got you here won’t get you there is a pretty good summary of the third failure pattern: when organizations continue to apply approaches that were well-suited to problems in the past but are ill-suited to problems in the present.

In the Product-Science Misalignment section of his post, Eric describes how Amazon’s traditional viciously customer-focused approach to development was a poor match for the research-style work that was required for developing AI. Rather than Amazon changing the way they worked in order to facilitate the activities of AI researchers, the researchers had to try to fit themselves into Amazon’s pre-existing product model. Ultimately, that effort failed.


I write mostly about software incidents on this blog, which are high-tempo affairs. But the failure of Amazon to compete effectively in the AI space, despite its head start with Alexa, its internal talent, and its massive set of compute resources, can also be viewed as a kind of incident. As demonstrated in this post, we can observe the same sorts of patterns in failures that occur in the span of months as we can in failures that occur in the span of minutes. How well Amazon is able to learn from this incident remains to be seen.

Pattern machines that we don’t understand

How do experts make decisions? One theory is that they generate a set of options, estimate the cost and benefits of each option, and then choose the optimal one. The psychology researcher Gary Klein developed a very different theory of expert decision-making, based on his studies of expert decision-making in domains such as firefighting, nuclear power plant operations, aviation, anesthesiology, nursing, and the military. Under Klein’s theory of naturalistic decision-making, experts use a pattern-matching approach to make decisions.

Even before Klein’s work, humans are already known to be quite good at pattern recognition. We’re so good at spotting faces that we have a tendency to see things as faces that aren’t actually faces, a phenomenon known as pareidolia.

(Wout Mager/Flickr/CC BY-NC-SA 2.0)

As far as I’m aware, Klein used the humans-as-black-boxes research approach of observing and talking to the domain experts: while he was metaphorically trying to peer inside their heads, he wasn’t doing any direct measurement or modeling of their brains. But if you are inclined to take a neurophysiological view of human cognition, you can see how the architecture of the brain provides a mechanism for doing pattern recognition. We know that the brain is organized as an enormous network of neurons, which communicate with each other through electrical impulses.

The psychology researcher Frank Rosenblatt is generally credited with being the first researcher to do computer simulations of a model of neural networks, in order to study how the brain works. He called his model a perceptron. In his paper The Perceptron: a probabilistic model for information storage and organization in the brain, he noted pattern recognition as one of the capabilities of the perceptron.

While perceptrons may have started out as a model for psychology research, they became one of a competing set of strategies for building artificial intelligence systems. The perceptron approach to AI was dealt a significant blow by the AI researchers Marvin Minsky and Seymour Papert in 1969 with the publication of their book Perceptrons. Minsky and Papert demonstrated that there were certain cognitive tasks that perceptrons were not capable of performing.

However, Minsky and Papert’s critique applied to only single-layer perceptron networks. It turns out that if you create a network out of multiple layers, and you add non-linear processing elements to the layers, then these limits to the capabilities of a perceptron no longer apply. When I took a graduate-level artificial neural networks course back in the mid 2000s, the networks we worked with had on the order of three layers. Modern LLMs have a lot more layers than that: the deep in deep learning refers to the large number of layers. For example, the largest GPT-3 model (from OpenAI) has 96 layers, the larger DeepSeek-LLM model (from DeepSeek) has 95 layers, and the largest Llama 3.1 model (from Meta) has 126 layers.

Here’s a ridiculously oversimplified conceptual block diagram of a modern LLM.

There’s an initial stage which takes text and turns it into a sequence of vectors. Then, those sequence of vectors get passed through the layers in the middle. Finally, you get your answer out at the end. (Note: I’m deliberately omitting discussion about what actually happens in the stages depicted by the oval and the diamond above, because I want to focus here on the layers in the middle for this post. I’m not going to talk at all about concepts like tokens, embedding, attention blocks, and so on. If you’re interested in these sorts of details, I highly recommend the video But what is a GPT? Visual intro to Transformers by Grant Sanderson).

We can imagine the LLM as a system that recognizes patterns at different levels of abstraction. The first and last layers deal directly with representations of words, so they have to operate at the word level of abstraction, let’s think of that as the lowest layer. As we go deeper into the network initially, we can imagine each layer as dealing with patterns at a higher level of abstraction, we could call them concepts. Since the last layer deals with words again, layers towards the end would be at a lower layer of abstraction.


But, really, this talk of encoding patterns at increasing and decreasing levels of abstraction is all pure speculation on my part, there’s no empirical basis to this. In reality, we have no idea what sorts of patterns are encoded in the middle layers. Do they correspond to what we humans think of as concepts? We simply have no idea how to interpret the meaning of the vectors that are generated by the intermediate layers. Are the middle layers “higher level” than the outer layers in the sense that we understand that term? Who knows? We just know that we get good results.


The things we call models have different kinds of applications. We tend to think first of scientific models, which are models that give scientists insight into how the world works. Scientific models are a type of model, but not the only one. There are also engineering models, whose purpose is to accomplish some sort of task. A good example of an engineering model is a weather prediction model that tells us what the weather will be like this week. Another good example of an engineering model is SPICE, which electrical engineers use to simulate electronic circuits.

Perceptrons started out as a scientific model of the brain, but their real success has been as an engineering model. Modern LLMs contain within them feedforward neural networks, which are the intellectual descendants of Rosenblatt’s perceptrons. Some people even refer to these as multilayer perceptrons. But LLMs are not an engineering model that was designed to achieve a specific task, the way that weather models or circuits models do. Instead, these are models that were designed to predict the next word in a sentence, and it just so happens that if you build and train your model the right way, you can use it to perform cognitive tasks that it was not explicitly designed to do! Or, as Sean Goedecke put it in a recent blog post (emphasis mine)

Transformers work because (as it turns out) the structure of human language contains a functional model of the world. If you train a system to predict the next word in a sentence, you therefore get a system that “understands” how the world works at a surprisingly high level. All kinds of exciting capabilities fall out of that – long-term planning, human-like conversation, tool use, programming, and so on.

This is a deeply weird and surprising outcome about building a text prediction system. We’ve built text prediction systems before. Claude Shannon was writing about probability-based models of natural language back in the 1940s in his famous paper that gave birth to the field of information theory. But it’s not obvious that once these models got big enough, we’d get results like we’re getting today, where you could ask the model questions and get answers. At least, it’s not obvious to me.

In 2020, the linguistics researchers Emily Bender and Alexander Koller published a paper titled Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. This is sometimes known as the octopus paper, because it contains a thought experiment about a hyper-intelligent octopus eavesdropping on a conversation between two English speakers by tapping into an undersea telecommunications cable, and how the octopus could never learn the meaning of English phrases through mere exposure. This seems to contradict Goedecke’s observation. They also note how research has demonstrated that humans are not capable of learning a new language through mere exposure to it (e.g., through TV or radio). But I think the primary thing this illustrates is how fundamentally different LLMs are from human brains, and how little we can learn about LLMs by making comparisons to humans. The architecture of an LLM is radically different from the architecture of a human brain, and the learning processes are also radically different. I don’t think a human could learn the structure of a new language by being exposed to a massive corpus and then trying to predict the next word. Our intuitions, which work well when dealing with humans, simply break down when we try to apply them to LLMs.


The late philosopher of mind Daniel Dennett proposed the concept of the intentional stance, as a perspective we take for predicting the behavior of things that we consider to be rational agents. To illustrate it, let’s contrast it with two other stances he mentions, the physical stance and the design stance. Consider the following three different scenarios, where you’re asked to make a prediction.

Scenario 1: Imagine that a child has rolled a ball up a long ramp which is at a 30 degree incline. I tell you that the ball is currently rolling up the ramp at 10 metres / second and ask you to predict what its speed will be one minute from now.

A ball that has been rolled up a ramp

Scenario 2: Imagine a car is driving up a hill at a 10 degree incline. I tell you that the car is currently moving at a speed of 60 km/h, and that the driver has cruise control enabled, also set at 60 km/h. I ask you to predict the speed of the car one minute from now.

A car with cruise control enabled, driving uphill

Scenario 3: Imagine another car on a flat road that going at 50 km/h, and is about to enter an intersection, and the traffic light has just turned yellow. Another bit of information I give you: the driver is heading to an important job interview and is running late. Again, I ask you to predict the speed of the car one minute from now.

In the first scenario (ball rolling up a ramp), we can predict the ball’s future speed by treating it as a physics problem. This is what Dennett calls the physical stance.

In the second scenario (car with cruise control enabled), we view the car as an artifact that was designed to maintain its speed when cruise control is enabled. We can easily predict that its future speed will be 60 km/h. This is what Dennett calls the design stance. Here, we are using our knowledge that the car has been designed to behave in certain ways in order to predict how it will behave.

In the third scenario (driver running late who encounters a yellow light), we think about the intentions of the driver: they don’t want to be late for their interview, so we predict that they will accelerate through the intersection. We predict that the driver will accelerate through the intersection, and so we predict their future speed will be somewhere around 60 km/h. This is what Dennett calls the intentional stance. Here, we are using our knowledge of the desires and beliefs of the driver to predict what actions they will take.

Now, because LLMs have been designed to replicate human language, our instinct is to apply to the intentional stance to predict their behavior. It’s a kind of pareidolia, we’re seeing intentionality in a system that mimics human language output. Dennett was horrified by this.

But the design stance doesn’t really help us either, with LLMs. Yes, the design stance enables us to predict that an LLM-based chatbot will generate plausible-sounding answers to our questions, because that is what it was designed to do. But, beyond that, we can’t really reason about its behavior.

Generally, operational surprises are useful in teaching us how our system works by letting us observe circumstances in which it is pushed beyond its limits. For example, we might learn about a hidden limit somewhere in the system that we didn’t know about before. This is one of the advantages of doing incident reviews, and it’s also one of the reasons that psychologists study optical illusions. As Herb Simon put it in The Sciences of the Artificial, Only when [a bridge] has been overloaded do we learn the physical properties of the materials from which it is built.

However, when an LLM fails from our point of view by producing a plausible but incorrect answer to a question, this failure mode doesn’t give us any additional insight into how the LLM actually works. Because, in a real sense, that LLM is still successfully performing the task that it was designed to do: generate plausible-sounding answers. We aren’t capable of designing LLMs that only produce correct answers, we can only do plausible ones. And so we learn nothing about what we consider LLM failures, because the LLMs aren’t actually failing. They are doing exactly what they are designed to do.

Dijkstra never took a biology course

Simplicity is prerequisite for reliability. — Edsger W. Dijkstra

Think about a system whose reliability had significantly improved over some period of time. The first example that comes to my mind is commercial aviation, but I’d encourage you to think of a software system you’re familiar with, either as a user (e.g., Google, AWS) or as a maintainer of a system that’s gotten more reliable over time.

Think of a system where the reliability trend looks like this

Now, for the system you have thought about where its reliability increased over time, think about what the complexity trend looks like over time for that system. I’d wager you’d see a similar sort of trend.

My claim about what the complexity trend looks like over time

Now, in general, increases in complexity don’t lead to increases in reliability. In some cases, engineers make a deliberate decision to trade off reliability for new capabilities. The telephone system today is much less reliable than it was when I was younger. As someone who grew up in the 80s and 90s, the phone system was so reliable that it was shocking to pick up the phone and not hear a dial tone. We were more likely to experience a power failure than a telephony outage, and the phones still worked when the power was out! I don’t think we even knew the term “dropped call”. Connectivity issues with cell phones are much more common than they ever were with landlines. But this was a deliberate tradeoff: we gave up some reliability in order to have ubiquitous access to a phone.

Other times, the increase in complexity isn’t the product of an explicit tradeoff but rather an entropy-like effect of a system getting more difficult to deal with over time as it accretes changes. This scenario, the one that most people have in mind when they think about increasing complexity in their system, is synonymous with the idea of tech debt. With tech debt the increase in complexity makes the system less reliable, because the risk of making a breaking change in the system has increased. I started this blog post with a quote from Dijkstra about simplicity. Here’s another one, along the same lines, from C.A.R. Hoare’s Turing Award Lecture in 1980:

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.

What Dijkstra and Hoare are saying is: the easier a software system is to reason about, the more likely it is to be correct. And this is true: when you’re writing a program, the simpler the program is, the more likely that you are to get it right. However, as we scale up from individual programs to systems, this principle breaks down. Let’s see how that happens.

Djikstra claims simplicity is a prerequisite for reliability. According to Dijkstra, if we encounter a system that’s reliable, it must be a simple system, because simplicity is required to achieve reliability.

reliability ⇒ simplicity

The claim I’m making in this post is the exact opposite: systems that improve in reliability do so by adding features that improves reliability, but come at the cost of increased complexity.

reliability ⇒ complexity

Look at classic works on improving the reliability of real-world systems like Michael Nygard’s Release It!, Joe Armstrong’s Making reliable distributed systems in the presence of software errors, and Jim Gray’s Why Do Computers Stop and What Can Be Done About It? and think about the work that we do to make our software systems more reliable, functionality like retries, timeouts, sharding, failovers, rate limiting, back pressure, load shedding, autoscaling, circuit breakers, transactions, and auxiliary systems we have to support our reliability work like an observability stack. All of this stuff adds complexity.

Imagine if I took a working codebase and proposed deleting all of the lines of code that are involved in error handling. I’m very confident that this deletion of code would make the codebase simpler. There’s a reason that programming books tend to avoid error handling cases in their examples, they do increase complexity! But if you were maintaining a reliable software system, I don’t think you’d be happy with me if I submitted a pull request that deleted all of the error handling code.

Let’s look at the natural world, where biology provides us with endless examples of reliable systems. Evolution has designed survival machines that just keep on going; they can heal themselves in simply marvelous ways. We humans haven’t yet figured out how to design systems which can recover from the variety of problems that a living organism can. Simple, though, they are not. They are astonishingly, mind-boggling-y complex. Organisms are the paradigmatic example of complex adaptive systems. However complex you think biology is, it’s actually even more complex than that. Mother nature doesn’t care that humans struggle to understand her design work.

Now, I’m not arguing that this reliability-that-adds-complexity is a good thing. In fact, I’m the first person who will point out that this complexity in service of reliability creates novel risks by enabling new failure modes. What I’m arguing instead is that achieving reliability by pursuing simplicity is a mirage. Yes, we should pay down tech debt and simplify our systems by reducing accidental complexity: there are gains in reliability to be had through this simplifying work. But I’m also arguing that successful systems are always going to get more complex over time, and some of that complexity is due to work that improves reliability. Successful reliable systems are going to inevitably get more complex. Our job isn’t to reduce that complexity, it’s to get better at dealing with it.