The hidden trade-offs of fine-grained progressive rollouts

A progressive rollout refers to the act of rolling out some new functionality gradually rather than all at once. This means that, when you initially deploy it, the change only impacts a fraction of your users. The idea behind a progressive rollout is to reduce the risk of a deployment by reducing the blast radius: if something goes wrong with the new thing during deployment, then the impact is much smaller than if you had deployed it all-at-once, to all of the traffic.

The impact of a bad rollout is shown in red

There are two general strategies for doing a progressive rollout. One strategy is coarse grained, where you stage your deploys across domains. For example, deploying the new functionality to one geographic region at a time. The second strategy is more fine-grained, where you define a ramp up schedule (e.g., 1% of traffic to the new thing, then 5%, then 10%, etc.).

Note that the two strategies aren’t mutually exclusive: you can stage your deploy across regions, and within each region, you can do a fine-grained ramp-up within each regions. And you can also think of it as a spectrum rather than two separate categories, since you can control the granularity. But I make the distinction here because I want to talk specifically about the fine-grained approach, where we use a ramp.

The ramp is clearly superior if you’re able to detect a problem during deployment, as shown in the diagram above. It’s a real win if you have automation that can automatically detect based on a metric like error rate. The problem with the ramp is the scenario when you don’t detect that there’s a problem with the deployment.

My claim here in this post is that if you don’t detect a problem with a fine-grained progressive rollout until after the rollout has completed, then it will tend to take you longer to diagnose what the problem is:

Paradoxically, progressive rollout can increase the blast radius by making after-the-fact diagnosis harder

Here’s my argument: once you know something is wrong with your system, but you don’t know what it is that has gone wrong, one of the things you’ll do is to look at dashboard graphs to look for a signal that identifies when the problem started, such as an increase in error rate or request latency. When you do a fine-grained progressive rollout, if something has gone wrong, then the impact will get smeared out over time, and it will be harder to identify the rollout as the relevant change by looking at a dashboard. If you’re lucky, your observability tools will let you slice on the rollout dimension. This is why I like coarse-grained rollouts, because if you have explicit deployment domains like geographical regions, then your observability tools will almost certainly let you slice the data based on those. Heck, you should have existing dashboards that already slice on it. But for fine-grained rolled-out, you may not think to slice on a particular rollout dimension (especially if you’re rolling out a bunch of things at once, all of them doing fine-grained deployments), and you might not even be able to.

To determine whether fine-grained rollouts are a net win depends on a number of factors whose values are not obvious, including:

the probability you detect a problem during the rollout vs after the rollout
how much longer it takes to diagnose the problem if not caught during rollout
your cost model for an incident

On the third bullet: the above diagram implicitly assumes that impact to the business is linear with respect to time. However, it might be non-linear: an hour-long incident may turn out to be more than twice as expensive as two half-hour-long incidents.

As someone who works in the reliability space, I’m acutely aware of the pain of incidents that take a long time to mitigate because they are difficult to diagnose. But I think that the trade-off of fine-grained progressive rollouts are generally not recognized as such: it’s easy to imagine the benefits when the problems are caught earlier, it’s harder to imagine the scenarios where the problem isn’t caught until later, and how harder things get because of it.

Nothing fails like a history of success

The Axiom of Experience: the future will be like the past, because, in the past, the future was like the past. – Gerald M. Weinberg, An Introduction to General Systems Thinking

Last Friday, the San Francisco Bay Area Rapid Transit system (known as BART) experienced a multiple hour outage. Later that day, the BART Deputy General Manager released a memo about the outage with some technical details. The memo is brief, but I was honestly surprised to see this amount of detail in a public document that was released so quickly after an incident, especially from a public agency. What I want to focus on in this post is this line (emphasis mine):

Specifically, network engineers were performing a cutover to a new network switch at
Montgomery St. Station… The team had already successfully performed eight similar cutovers earlier this year.

This reminded me of something I read in the Buildkite writeup from an incident that happened back in January of this year (emphasis mine):

Given the confidence gained by initial load testing and the migrations already performed over the past year, we wanted to allow customers to take advantage of their seasonal low periods to perform shard migrations, as a win-win. This caused us to discount the risk of performing migrations during a seasonal low period and what impacts might emerge when regular peak traffic returned.

It also reminded me about the 2022 Rogers Telecommunications outage in Canada (emphasis mine, [redacted] comments in the original):

Rogers had assessed the risk for the initial change of this seven-phased process as “High”. Subsequent changes in the series were listed as “Medium.” [redacted] was “Low” risk based on the Rogers algorithm that weighs prior success into the risk assessment value. Thus, the risk value for [redacted] was reduced to “Low” based on successful completion of prior changes.

Whenever we make any sort of operational change, we have a mental model of the risk associated with the change. We view novel changes (I’ve never done something like this before!) as riskier than changes we’ve performed successfully multiple times in the past (I’ve done this plenty of times). I don’t think this sort of thinking is a fallacy: rather, it’s a heuristic, and it’s generally a pretty effective one! But, like all heuristics, it isn’t perfect. As shown in the examples above, the application of this heuristic can result in a miscalibrated mental model of the risk associated with a change.

So, what’s the broader lesson? In practice, our risk models (implicit or otherwise) are always miscalibrated: a history of past successes is just one of multiple avenues that can lead us astray. Trying to achieve a perfect risk model is like trying to deploy software that is guaranteed to have zero bugs: it’s never going to happen. Instead, we need to accept the reality that, like our code, our models of risk will always have defects that are hidden from us until it’s too late. So we’d better get damned good at recovery.

My favorite developer productivity research method that nobody uses

You’ve undoubtedly heard of the psychological concept called flow state. This is the feeling you get when you’re in the zone, where you’re doing some sort of task, and you’re just really into it, and you’re focused, and it’s challenging but not frustratingly so. It’s a great feeling. You might experience this with a work task, or a recreational one, like when playing a sport or a video game. The pioneering researcher on the phenomenon of flow was the Hungarian-American psychologist Mihaly Csikszentmihalyi, and he wrote a popular book on the subject back in 1990 with the title Flow: The Psychology of Optimal Experience, which I read many years ago. But the one thing that stuck around most with me from Csikszentmihalyi’s book on Flow was the research method that he used to study flow.

One of the challenges of studying people’s experiences is that it’s difficult for researchers to observe them directly. This problem comes up when an organization tries to do research on the current state of developer productivity within the organizations. I harp on “make work visible” a lot because so much of the work we do in the software world is so hard for others to see. There are different data collection techniques that developer productivity researchers use, including surveys, interviews, focus groups, as well as automatic collection of metrics, like the DORA metrics. Of those, only the automatic collection of metrics focuses on in-the-moment data, and it’s a very thin type of data at that. Those metrics can’t give you any insights into the challenges that your developers are facing.

My preferred technique is the case study, which I try to apply to incidents. I like the incident case study technique because it gives us an opportunity to go deep into the nature of the work for a specific episode. But incident-as-case-study only works for, well, incidents, and while a well-done incident case study can shine a light on the nature of the development work, there’s also a lot that it will miss.

Csikszentmihalyi used a very clever approach which was developed by his PhD student Suzanne Prescott, called experience sampling. He gave the participants of his study pagers, and he would page them at random times. When paged, the participants would write down information about their experiences in a journal in-the-moment. In this way, he was able to collect information about subjective experience, without the problems you get when trying to elicit an account retrospectively.

I’ve never read about anybody trying to use this approach to study developer productivity, and I think that’s a shame. It’s something I’ve wanted to try myself, except that I have not worked in the developer productivity space for a long, long time.

These days, I’d probably use slack rather than a pager and journal to randomly reach out to the volunteers during the study and collect their responses, but the principle is the same. I’ve long wanted to capture an “are you currently banging your head against a wall” metric from developers, but with experience sampling, you could capture a “what are you currently banging your head against the wall about?”

Would this research technique actually work for studying developer productivity issues within an organization? I honestly don’t know. But I’d love to see someone try.

Note: I originally had the incorrect publication date for the Flow book. Thanks to Daniel Miller for the correction.

The problems that accountability can’t fix

Accountability is a mechanism that achieves better outcomes by aligning incentives, in particular, negative ones. Specifically: if you do a bad thing, or fail to do a good thing, under your sphere of control, then bad things will happen to you. I recently saw several LinkedIn posts that referenced the U.S. Coast Guard report on the OceanGate experimental submarine implosion. These posts described how this incident highlights the importance of accountability in leadership. And, indeed, the report itself references accountability five times.

However, I think this incident is an example of a type of problem where accountability doesn’t actually help. Here I want to talk about two classes of problems where accountability is a poor solution to addressing the problem, where the OceanGate accident falls into the second class.

Coordination challenges

Managing a large organization is challenging. Accountability is a popular tool in such organizations to ensure that work actually gets done, by identifying someone who is designated as the stuckee for ensuring that a particular task or project gets completed. I’ll call this top-down accountability. This kind of accountability is sometimes referred to, unpleasantly, as the “one throat to choke” model.

For this model to work, the problem you’re trying to solve needs to be addressable by the individual that is being held accountable for it. Where I’ve seen this model fall down is in post-incident work. As I’ve written about previously, I’m a believer in the resilience engineering model of complex systems failures, where incidents arise due to unexpected interactions between components. These are coordination problems, where the problems don’t live in one specific component, but, rather, how the components interact with each other.

But this model of accountability demands that we identify an individual to own the relevant follow-up incident work. And so it creates an incentive to always identify a root cause service, which is owned by the root cause team, who are then held accountable for addressing the issue.

Now, just because you have a coordination problem, that doesn’t mean that you don’t need an individual to own driving the reliability improvements around it. In fact, that’s why technical project managers (known as TPMs) exist. They act as the accountable individuals for efforts that require coordination across multiple teams, and every large tech organization that I know of employs TPMs. The problem I’m highlighting here, such as in the case of incidents, is that accountability is applied as a solution without recognizing that the problem revealed by the incident is a coordination problem.

You can’t solve a coordination problem by identifying one of the agents involved in the coordination and making them accountable. You need someone who is well-positioned in the organization, recognizes the nature of the problem, and has the necessary skills to be the one who is accountable.

Miscalibrated risk models

The other way people talk about accountability is about holding leaders such as politicians and corporate executives responsible for their actions, where there are explicit consequences for them acting irresponsibly, including actions such as corruption, or taking dangerous risks with the people and resources that have been entrusted to them. I’ll call this bottom-up accountability.

The bottom-up accountability enforcement tool of choice in France, circa 1792

This brings us back to the OceanGate accident of June 18, 2023. In this accident, the TITAN submersible imploded, killing everyone aboard. One of the crewmembers who died was Stockton Rush, who was both pilot of the vessel and CEO of OceanGate.

The report is a scathing indictment of Rush. In particular, it criticizes how he sacrificed safety for his business goals, ran an organization that lacked that the expertise required to engineer experimental submersibles, promoted a toxic workplace culture that suppressed signs of trouble instead of addressing them, and centralized all authority in himself.

However, one thing we can say about Rush was that he was maximally accountable. After all, he was both CEO and pilot. He believed so much that TITAN was safe that he literally put his life on the line. As Nassim Taleb would put it, he had skin in the game. And yet, despite this accountability, he still took irresponsible risks, which led to disaster.

By being the pilot, Rush personally accepted the risks. But his actual understanding of the risk, his model of risk, was fundamentally incorrect. It was wrong, dangerously so.

Rush assessed the risk index of the fateful dive at 35. The average risk index of previous dives was 36.

Assigning accountability doesn’t help when there’s an expertise gap. Just as giving a software engineer a pager does not bestow up them the skills that they need to effectively do on-call operations work, having the CEO of OceanGate also be the pilot of the experimental vehicle did not lead to him being able to exercise better judgment about safety.

Rush’s sins weren’t merely lack of expertise, and the report goes into plenty of detail about his other management shortcomings that contributed to this incident. But, stepping back from the specifics of the OceanGate accident, there’s a greater point here that making executives accountable isn’t sufficient to avoid major incidents, if the risk models that executives use to make decisions are are out of whack with the actual risks. And by risk models here, I don’t just mean some sort of formal model like the risk assessment matrix above. Everyone carries with them an implicit risk model in their heads: this is a mental risk model.

Double binds

While the CEO also being a pilot sounds like it should be a good thing for safety (skin in the game!), it also creates a problem that the resilience engineering folks refer to as a double bind. Yes, Rush had strong incentives to ensure he wasn’t taking stupid risks, because otherwise he might die. But he also had strong incentives to keep the business going, and those incentives were in direct conflict with the safety incentives. But double-binds are not just an issue for CEO-pilots, because anyone in the organization will feel pressure from above to make decisions in support of the business, which may cut against safety. Accountability doesn’t solve the problem of double-binds, it exacerbates them, by putting someone on the hook for delivering.

Once again, from the resilience engineering literature, one way to deal with this problem is through cross-checks. For example, see the paper Collaborative Cross-Checking to Enhance Resilience by Patterson, Woods, Cook, and Render. Instead of depending on a single individual (accountability), you take advantage of the different perspectives of multiple people (diversity).

You also need someone who is not under a double-bind who has the authority to say “this is unsafe”. That wasn’t possible at OceanGate, where the CEO was all-powerful, and anybody who spoke up was silenced or pushed out.

On this note, I’ll leave you with a six-minute C-SPAN video clip from 2003. In this clip, the resilience engineering David Woods spoke at a U.S. Senate hearing in the wake of the Columbia accident. Here he was talking about the need for an independent safety organization at NASA as a mechanism for guarding against the risks that emerge from double binds.

(I could not get it to embed, here’s the link: https://www.c-span.org/clip/senate-committee/user-clip-david-woods-senate-hearing/4531343)

(As far as I know, the new independent safety organization that Woods proposed was not created)

Easy will always trump simple

One of the early criticisms of Darwin’s theory of evolution by natural selection was about how it could account for the development of complex biological structures. It’s often not obvious to us how the earlier forms of some biological organ would have increase fitness. “What use”, asked the 19th century English biologist St. George Jackson Mivart, “is half a wing?”

One possible answer is that while half a wing might not be useful for flying, it may have had a different function, and evolution eventually repurposed that half-wing for flight. This concept, that evolution can take some existing trait in an organism that serves a function and repurpose it to serve a different function, is called exaptation.

Biology seems to be quite good at using the resources that it has at hand in order to solve problems. Not too long ago, I wrote a review of the book How Life Works: A User’s Guide to the New Biology by the British science writer Philip Ball. One of the main themes of the book is how biologists’ view of genes has shifted over time from the idea DNA-as-blueprint to DNA-as-toolbox. Biological organisms are able to deal effectively with a wide range of challenges by having access to a broad set of tools, which they can deploy as needed based on their circumstances.

We’ll come back to the biology, but for a moment, let’s talk about software design. Back in 2011, Rich Hickey gave a talk at the (sadly defunct) Strange Loop conference with the title Simple Made Easy (transcript, video). In this talk, Hickey drew a distinction between the concepts of simple and easy. Simple is the opposite of complex, where easy is something that’s familiar to us: the term he used to describe the concept of easy that I really liked was at hand. Hickey argues that when we do things that are easy, we can initially move quickly, because we are doing things that we know how to do. However, because easy doesn’t necessarily imply simple, we can end up with unnecessarily complex solutions, which will slow us down in the long run. Hickey instead advocates for building simple systems. According to Hickey, simple and easy aren’t inherently in conflict, but are instead orthogonal. Simple is an absolute concept, and easy is relative to what the software designer already knows.

I enjoy all of Rich Hickey’s talks, and this one is no exception. He’s a fantastic speaker, and I encourage you to listen to it (there are some fun digs at agile and TDD in this one). And I agree with the theme of his talk. But I also think that, no matter how many people listen to this talk and agree with it, easy will always win out over simple. One reason is the ever-present monster that we call production pressure: we’re always under pressure to deliver our work within a certain timeframe, and easier solutions are, by definition, going to be ones that are faster to implement. That means the incentives on software developers tilts the scales heavily towards the easy side. Even more generally, though, easy is just too effective a strategy for solving problems. The late MIT mathematics professor Gian-Carlo Rota noted that every mathematician has only a few tricks, and that includes famous mathematicians like Paul Erdős and David Hilbert.

Let’s look at two specific examples of the application of easy from the software world, specifically, database systems. The first example is about knowledge that is at-hand. Richard Hipp implemented the SQLite v1 as a compiler that would translate SQL into byte code, because he had previous experience building compilers but not building database engines. The second example is about an exaptation, leveraging an implementation that was at-hand. Postgres’s support for multi-version concurrency control (MVCC) relies upon an implementation that was originally designed for other features, such as time-travel queries. (Multi-version support was there from the beginning, but MVCC was only added in version 6.5).

Now, the fact that we rely frequently on easy solutions doesn’t necessarily mean that they are good solutions. After all, the Postgres source I originally linked to has the title The Part of PostgreSQL We Hate the Most. Hickey is right that easy solutions may be fast now, but they will ultimately slow us down, as the complexity accretes in our system over time. Heck, one of the first journal papers that I published was a survey paper on this very topic of software getting more difficult to maintain over time. Any software developer that has worked at a company other than a startup has felt the pain of working with a codebase that is weighed down by what Hickey refers to in his talk as incidental complexity. It’s one of the reasons why startups can move faster than more mature organizations.

But, while companies are slowed down by this complexity, it doesn’t stop them entirely. What Hickey refers to in his talk as complected systems, the resilience engineering researcher David Woods refers to as tangled. In the resilience engineering view, Woods’s tangled, layered networks inevitably arise in complex systems.

Hickey points out that humans can only keep a small number of entities in their head at once, which puts a hard limit on our ability to reason about our systems. But the genuinely surprising thing about complex systems, including the ones that humans build, is that individuals don’t have to understand the system for them to work! It turns out that it’s enough for individuals to only understand parts of the system. Even without anyone having a complete understanding of the whole system, we humans can keep the system up and running, and even extend its functionality over time.

Now, there are scenarios when we do need to bring to bear an understanding of the system that is greater than any one person possesses. My own favorite example is when there’s an incident that involves an interaction between components, where no one person understands all of the components involved. But here’s another thing that human beings can do: we can work together to perform cognitive tasks that none of us could do on their own, and one such task is remediating an incident. This is an example of the power of diversity, as different people have different partial understandings of the system, and we need to bring those together.

To circle back to biology: evolution is terrible at designing simple systems: I think biological systems are the most complex systems that we humans have encountered. And yet, they work astonishingly well. Now, I don’t think that we should design software the way that evolution designs organisms. Like Hickey, I’m a fan of striving for simplicity in design. But I believe that complex systems, whether you call them complected or tangled, are inevitable, they’re just baked in to the fabric of the adaptive universe. I also believe that easy is such a powerful heuristic that it is also baked in to how we build and involved systems. That being said, we should be inspired, by both biology and Hickey, to have useful tools at-hand. We’re going to need them.

The trap of tech that’s great in the small but not in the large

There are software technologies that work really well in-the-small, but they don’t scale up well. The challenge here is that the problem size grows incrementally, and migrating off of them requires significant effort, and so locally it makes sense it to keep using them, but then you reach a point where you’re well into the size where they are a liability rather than an asset. Here are some examples.

Shell scripts

Shell scripts are fantastic in the small: throughout my career, I’ve written hundreds and hundreds of bash scripts that are twenty lines are less, typically closer than to ten, frequently less than even five lines. But, as soon as I need to write an if statement, that’s a sign to me that I should probably write it in something like Python instead. Fortunately, I’ve rarely encountered large shell scripts in the wild these days, with DevStack being a notable exception.

Makefiles

I love using makefiles as simple task runners. In fact, I regularly use just, which is like an even simpler version of make, and has similar syntax. And I’ve seen makefiles used to good effect for building simple Go programs.

But there’s a reason technologies like Maven, Gradle, and Bazel emerged, and it’s because large-scale makefiles are an absolute nightmare. Someone even wrote a paper called Recursive Make Considered Harmful.

YAML

I’m not a YAML hater, I actually like it for configuration files that are reasonably sized, where “reasonably sized” means something like “30 lines or fewer”. I appreciate support for things like comments and not having to quote strings.

However, given how much of software operations runs on YAML these days, I’ve been burned too many times by having to edit very large YAML files. What’s human-readable in the small isn’t human-readable is the large.

Spreadsheets

The business world runs on spreadsheets: they are the biggest end-user programming success story in human history. Unfortunately, spreadsheets sometimes evolve into being de facto databases, which is terrifying. The leap required to move from using a spreadsheet as your system of record to a database is huge, which explains why this happens so often.

Markdown

I’m a big fan of Markdown, but I’ve never tried to write an entire book with it. I’m going to outsource this example to Hillel Wayne, see his post Why I prefer rST to markdown: I will never stop dying on this hill.

Formal specs as sets of behaviors

Amazon’s recent announcement of their spec-driven AI tool, Kiro, inspired me to write a blog post on a completely unrelated topic: formal specifications. In particular, I wanted to write about how a formal specification is different from a traditional program. It took a while for this idea to really click in my own head, and I wanted to motivate some intuition here.

In particular, there have been a number of formal specification tools that have been developed in recent years which use programming-language-like notation, such as FizzBee, P, PlusCal, and Quint. I think these notations are more approachable for programmers than the more set-theoretic notation of TLA+. But I think the existence of programming-language-like formal specification languages makes it even more important to drive home the difference between a program and a formal spec.

The summary of this post is: a program is a list of instructions, a formal specification is a set of behaviors. But that’s not very informative on its own. Let’s get into it.

What kind of software do we want to specify

Generally speaking, we can divide the world of software into two types of programs. There is one type where you give the program a single input, and it produces a single output, and then it stops. The other type is one that runs for an extended period of time and interacts with the world by receiving inputs over time, and generating outputs over time. In a paper published in the mid 1980s, the computer scientists David Harel (developer of statecharts) and Amir Pneuli (the first person to apply temporal logic to software specifications) made a distinction between programs they called transformational (which is like the first kind) and the another which they called reactive.

Source: On the Development of Reactive Systems by Harel and Pnueli

A compiler is an example of a transformational tool, but you can think of many command-line tools as falling into this category. An example of the second type is the flight control software in an airplane, which runs continuously, taking in inputs and generating outputs over time. In my world, we call services are a great example of reactive systems. They’re long-running programs that receive requests as inputs and generate responses as outputs. The specifications that I’m talking about here apply to the more general reactive case.

A motivating example: a counter

Let’s consider the humble counter as an example of a system whose behavior we want to specify. I’ll describe what operations I want my counter to support using Python syntax:

class Counter:
  def inc() -> None:
    ...
  def get() -> int:
    ...
  def reset() -> None:
    ...

My example will be sequential to keep things simple, but all of the concepts apply to specifying concurrent and distributed systems as well. Note that implementing a distributed counter is a common system design interview problem.

Behaviors

Above I just showed the method signatures, but I implemented this counter and interacted with it in the Python REPL, here’s what that looked like:

>>> c = Counter()
>>> c.inc()
>>> c.inc()
>>> c.inc()
>>> c.get()
3
>>> c.reset()
>>> c.inc()
>>> c.get()
1

People sometimes refer to the sort of thing above by various names: a session, an execution, an execution history, an execution trace. The formal methods people refer to this sort of thing as a behavior, and that’s the term that we’ll use in the rest of this post. Specifications are all about behaviors.

Sometimes I’m going to draw behaviors in this post. I’m going to denote a behavior as a squiggle.

To tie this back to the discussion about reactive systems, you can think of method invocation as inputs, and return values as outputs. The above example is a correct behavior for our counter. But a behavior doesn’t have to be correct: a behavior is just an arbitrary sequence of inputs and outputs. Here’s an example of an incorrect behavior for our counter.

>>> c = Counter()
>>> c.inc()
>>> c.get()
4

We expected the get method to return 1, but instead it returned 4. If we saw that behavior, we’d say “there’s a bug somewhere!”

Specifications and behaviors

What we want out of a formal specification is a device that can answer the question: “here’s a behavior: is it correct or not?”. That’s what a formal spec is for a reactive system. A formal specification is an entity such that, given a behavior, we can determine whether the behavior satisfies the spec. Correct = satisfies the specification.

Once again, a spec is a thing that will tell us whether or not a given behavior is correct.

A spec as a set of behaviors

I depicted a spec in the diagram above as, literally, a black box. Let’s open that box. We can think of a specification simply as a set that contains all of the correct behaviors. Now, the “correct?” processor above is just a set membership check: all it does it check if behavior is an element of the set spec.

What could be simpler?

Note that this isn’t a simplification: this is what a formal specification is in a system like TLA+. It’s just a set of behaviors: nothing more, nothing less.

Describing a set of behaviors

You’re undoubtedly familiar with sets. For example, here’s a set of the first three positive natural numbers: $\{1,2,3\}$ . Here, we described the set by explicitly enumerating each of the elements.

While the idea of a spec being a set of behaviors is simple, actually describing that set is trickier. That’s because we can’t explicitly enumerate the elements of the set like we did above. For one thing, each behavior is, in general, of infinite length. Taking the example of our counter, one valid behavior is to just keep calling any operation over and over again, ad infinitum.

>>> c = Counter()
>>> c.get()
0
>>> c.get()
0
>>> c.get()
0
... (forever)

A behavior of infinite length

This is a correct behavior for our counter, but we can’t write it out explicitly, because it goes on forever.

The other problem is that the specs that we care about typically contain an infinite number of behaviors. If we take the case of a counter, for any finite correct behavior, we can always generate a new correct behavior by adding another inc, get, or reset call.

So, even if we restricted ourselves to behaviors of finite length, if we don’t restrict the total length of a behavior (i.e., if our behaviors are finite but unbounded, like natural numbers), then we cannot define a spec by explicitly enumerating all of the behaviors in the specification.

And this is where formal specification languages come in: they allow us to define infinite sets of behaviors without having to explicitly enumerate every correct behavior.

Describing infinite sets by generating them

Mathematicians deal with infinite sets all of the time. For example, we can use set-builder notation to describe the infinitely large set of all even natural numbers without explicitly enumerating each one:

$\{2k \mid k \in \mathbb{N}\}$

The example above references another infinite set, the set of natural numbers (ℕ). How do we generate that infinite set without reference to another one?

One way is to define the set by describing how to generate the set of natural numbers. To do this, we specify:

an initial natural number (either 0 or 1, depending on who you ask)
a successor function for how to generate a new natural number from an existing one

This allows us to describe the set of natural numbers without having to enumerate each one explicitly. Instead, we describe how to generate them. If you remember your proofs by induction from back in math class, this is like defining a set by induction.

Specifications as generating a set of behaviors

A formal specification language is just a notation for describing a set of behaviors by generating them. In TLA+, this is extremely explicit. All TLA+ have two parts:

Init – which describes all valid initial states
Next – which describes how to extend an existing valid behavior to one or more new valid behavior(s)

Here’s a visual representation of generating correct behaviors for the counter.

Generating all correct behaviors for our counter

Note how in the case of the counter, there’s only one valid initial state in a behavior: all of the correct behaviors start the same way. After that, when generating a new behavior based on a previous one, whether one behavior or multiple behaviors can be generated depends on the history. If the last event was a method invocation, then there’s only one valid way to extend that behavior, which is the expected response of the request. If the last event was a return of a method, then you can extend the behavior in three different ways, based on the three different methods you can call on the counter.

The (Init, Next) pair describe all of the possible correct behaviors of the counters by generating them.

Nondeterminism

One area where formal methods can get confusing for newcomers is that the notation for writing the behavior generator can look like a programming language, particularly when it comes to nondeterminism.

When you’re writing a formal specification, you want to express “here are all of the different ways that you can validly extend this behavior”, hence you get that branching behavior in the diagram in the previous section: you’re generating all of the possible correct behaviors. In a formal specification, when we talk about “nondeterminism”, we mean “there are multiple ways a correct behavior can be extended”, and that includes all of the different potential inputs that we might receive from outside. In formal specifications, nondeterminism is about extending a correct behavior along multiple paths.

On the other hand, in a computer program, when we talk about code being nondeterministic, we mean “we don’t know which path the code is going to take”. In the programming world, we typically use nondeterminism to refer to things like random number generation or race conditions. One notable area where they’re different is that formal specifications treat inputs as a source of nondeterminism, whereas programmers don’t include inputs when they talk about nondeterminism. If you said “user input is one of the sources of nondeterminism”, a formal modeler would nod their head, and a programmer would look at you strangely.

Properties of a spec: sets of behaviors

I’ve been using the expressions correct behavior and behavior satisfies the specification interchangeably. However, in practice, we build formal specifications to help us reason about the correctness of the system we’re trying to build. Just because we’ve written a formal specification doesn’t mean that the specification is actually correct! That means that we can’t treat the formal specification that we build as the correct description of the system in general.

The most frequent tactic people use to reason about their formal specifications is to define correctness properties and use a model-checking tool to check whether their specification conforms to the property or not.

Here’s an example of a property for our counter: the get operation always returns a non-negative value. Let’s give it a name: the no-negative-gets property. If our specification has this property, we don’t know for certain it’s correct. But if it doesn’t have this property, we know for sure something is wrong!

Like a formal specification, a property is nothing more than a set of behaviors! Here’s an example of a behavior that satisfies the no-negative-gets property:

>>> c = Counter()
>>> c.get()
0
>>> c.inc()
>>> c.get()
1

And here’s another one:

>>> c = Counter()
>>> c.get()
5
>>> c.inc()
>>> c.get()
3

Note that the second wrong probably looks wrong to you. We haven’t actually written out a specification for our counter in this post, but if we did, the behavior above would certainly violate it: that’s not how counters work. On the other hand, it still satisfies the no-negative-gets property. In practice, the set of behaviors defined by a property will include behaviors that aren’t in the specification, as depicted below.

When we check that that a spec satisfies a property, we’re checking that Spec is a subset of Property. We just don’t care about the behaviors that are in the Property set but not in the Spec set. What we care about are behaviors that are in Spec that are not in Property. That tells us that our specification can generate behaviors that do not possess the property that we care about.

Consider the property: get always return a positive number. We can call it all-positive-gets. Note that zero is not considered a positive number. Assuming our counter specification starts at zero, here’s a behavior that violates the all-positive-gets property:

>>> c = Counter()
>>> c.get()
0

Thinking in sets

When writing formal specifications, I found that thinking in terms of sets of behaviors was a subtle but significant mind-shift from thinking in terms of writing traditional programs. Where it helped me most is in making sense of the errors I get when debugging my TLA+ specifications using the TLC model checker. After all, it’s when things break is when you really need to understand whats’s going on under the hood. And I promise you, when you write formal specs, things are going to break. That’s why we write them, to find where the breaks are.

Cloudflare and the infinite sadness of migrations

(With apologies to The Smashing Pumpkins)

A few weeks ago, Cloudflare experienced a major outage of their popular 1.1.1.1 public DNS resolver.

Technically, the DNS resolver itself was working just fine: it was (as far as I’m aware) up and running the whole time. The problem was that nobody on the Internet could actually reach it. The Cloudflare public write-up is quite detailed, and I’m not going to summarize it here. I do want to bring up one aspect of their incident, because it’s something I worry about a lot from a reliability perspective: migrations.

Cloudflare’s migration

When this incident struck, Cloudflare supported two different ways of managing what they call service topologies. There was a newer system that supported progressive rollout, and an older system where the changes occurred globally. The Cloudflare incident involved the legacy system, which makes global changes, which is why the blast radius of this incident was so large.

Source: https://blog.cloudflare.com/cloudflare-1-1-1-1-incident-on-july-14-2025/

Cloudflare engineers were clearly aware that these sorts of global changes are dangerous. After all, I’m sure that’s one of the reasons why they built their new system in the first place. But migrating all of the way to the new thing takes time.

Migrations and why I worry about them

If you’ve ever worked at any sort of company that isn’t a startup, you’ve had to deal with a migration. Sometimes a migration impacts only a single team that owns the system in question, but often migrations are changes that are large in scope (typically touching many teams) which, while providing new capabilities to the organization as a whole, don’t provide much short-term benefit to the teams who have to make a change to accommodate the migration.

A migration is a kind of change that, almost by definition, the system wasn’t originally designed to accommodate. We build our systems to support making certain types of future changes, and migrations are exactly not these kinds of changes. Each migration is typically a one-off type of change. While you’ll see many migrations if you work at a more mature tech company, each one will be different enough that you won’t be able to leverage common tooling from one migration to help make the next one easier.

All of this adds up to reliability risk. While a migration-related change wasn’t a factor in the Cloudflare incident, I believe that such changes are inherently risky, because you’re making a one-off change to the way that your system works. Developers generally have a sense that these sorts of changes are risky. As a consequence, for an individual on a team who has to do work to support somebody else’s migration, all of the incentives push them towards dragging their feet: making the migration-related change takes time away from their normal work, and increases the risk they break something. On the other hand, completing the migration generally doesn’t provide them short-term benefit. The costs typically outweigh the benefits. And so all of the forces push towards migrations taking a long time.

But a delay in implementing a migration is also a reliability risk, since migrations are often used to improve the reliability of the system. The Cloudflare incident is a perfect example of this: the newer system was safer than the old one, because it supported staged rollout. And while they ran the new system, they had to run the old one as well.

Why run one system when you can run two?

The scariest type of migration to me is the big bang migration, where you cut over all at once from the old system to the new one. Sometimes you have no choice, but it’s an approach that I personally would avoid whenever possible. The alternative is to do incremental migration, migrating parts of the system over time. To do incremental migration, you need to run the old system and the new system concurrently, until you’ve completely finished the migration and can shut the old system down. When I worked at Netflix, people used the term Roman riding to refer to running the old and new system in parallel, in reference to a style of horseback riding.

The problem with Roman riding is that it’s risky as well. While incremental is safer than big bang, running two systems concurrently increases the complexity of the system. There are many, many opportunities for incidents while you’re in the midst of a migration running the two systems in parallel.

What is to be done?

I wish I had a simple answer here. But my unsatisfying one is that engineering organizations at tech companies need to make migrations a part of their core competency, rather than seeing them as one-off chores. I frequently joke that platform engineering should really be called migration engineering, because any org large enough to do platform engineering is going to be spending a lot of its cycles doing migrations.

Migrations are also unglamorous work: nobody’s clamoring for the title of migration engineer. People want to work on greenfield projects, not deal with the toil of a one-off effort to move the legacy thing onto the new thing. There’s also not a ton written on doing migrations. A notable exception is (fellow TLA+ enthusiast) Marianne Bellotti’s book Kill It With Fire, which sits on my bookshelf, and which I really should re-read.

I’ll end this post with some text from the “Remediation and follow-up steps” of the Cloudflare writeup:

We are implementing the following plan as a result of this incident:

Staging Addressing Deployments: Legacy components do not leverage a gradual, staged deployment methodology. Cloudflare will deprecate these systems which enables modern progressive and health mediated deployment processes to provide earlier indication in a staged manner and rollback accordingly.

Deprecating Legacy Systems: We are currently in an intermediate state in which current and legacy components need to be updated concurrently, so we will be migrating addressing systems away from risky deployment methodologies like this one. We will accelerate our deprecation of the legacy systems in order to provide higher standards for documentation and test coverage.

I’m sure they’ll prioritize this particular migration because of the attention garnered on it from this incident. But I also bet there are a whole lot more in-flight migrations at Cloudflare, as well as at other companies, that increase complexity through maintaining two systems and delaying moving to the safer thing. What are they actually going to do in order to complete those other migrations more quickly? If it was easy, it would already be done.

Re-reading Technopoly

Technopoly by Neil Postman, published in 1993

“Can language models be too big? asked the researchers Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell in their famous Stochastic Parrots paper about LLMs, back in 2021. Technopoly is Neil Postman’s answer to that question, despite it beoing written back in the mid-nineties.

Postman is best known for his 1985 book Amusing Ourselves to Death, about the impact of television on society. Postman passed away in 2003, one year before Facebook was released, and two years before YouTube. This was probably for the best, as social media and video sharing services like Instagram and TikTok would have horrified him, being the natural evolution of the trends he was writing about in the 1980s.

The rise of LLMs inspired me to recently re-read Technopoly. In what Postman calls technopoly, technological progress becomes the singular value that society pursues. A technopoly treats access to information as an intrinsic good: more is always better. As a consequence, it values removing barriers to the collection and transmission of information; Postman uses the example of the development of the telegraph as a technology that eliminated distance as an information constraint.

The collection and transmission of information is central to Postman’s view of technology, the book focuses entirely on such technologies, such as writing, the stethoscope, the telescope, and the computer; he would have been very comfortable with our convention of referring to software-based companies as tech companies. Consider Google’s stated mission: to organize the world’s information and make it universally accessible and useful. This statement makes for an excellent summary of the value system of technopoly. In a technopoly, the solutions to our problems can always be found by collecting and distributing more information.

More broadly, Postman notes that the worldview of technopoly is captured in Frederick Taylor’s principles of scientific management:

the primary goal of work is efficiency
technical calculation is always to be preferred over human judgment, which is not trustworthy
subjectivity is the enemy of clarity of thought
what cannot be measured can be safely ignored, because it either does not exist, or it has no value

I was familiar with Taylor’s notion of scientific management before, but it was almost physically painful for me to see its values laid out explicitly like this, because it describes the wall that I so frequently crash into when I try to advocate for a resilience engineering perspective on how to think about incidents and reliability. Apparently, I am an apostate in the Church of Technopoly.

Postman was concerned about the harms that can result from treating more information as an unconditional good. He worried about information for its own sake, divorced of human purpose and stripped of its constraints, context, and history. Facebook ran headlong into the dangers of unconstrained information transmission when its platform was leveraged in Myanmar to promote violence. In her memoir Careless People, former Facebook executive Sarah Wynn-Williams documents how Facebook as an organization was fundamentally unable to deal with the negative consequences of the platform that they had constructed. Wynn-Williams focuses on the moral failures of the executive leadership of Facebook, hence the name of the book. But Postman would also indict technopoly itself, the value system that Facebook was built on, with its claims that disseminating information is always good. In a technopoly, reducing obstacles to information access is always a good thing.

Technopoly as a book is weakest in its critique of social science. Postman identifies social scientists as a class of priests in a technopoly, the experts who worship technopoly’s gods of efficiency, precision, and objectivity. His general view of social science research results are that they are all either obviously true or absurdly false, where the false ones are believed because they come from science. I think Postman falls into the same trap as the late computer scientist Edsger Dijkstra in discounting the value of social science, both in Duncan Watts’s sense of Everything is Obvious: Once you Know the Answer and in the value of good social science protecting us from bad social science. I say this as someone who draws from social science research every day when I examine an incident. Given Postman’s role as a cultural critic, I suspect that there’s some “hey, you’re on my turf!” going on here.

Postman was concerned that technopoly is utterly uninterested in human purpose or a coherent worldview. And he’s right that social science is silent on both matters. But his identification of social scientists as technopoly’s priests hasn’t borne out. Social science certainly has its problems, with the replication crisis in psychology being a glaring example. But that’s a crisis that undermines faith in psychology research, whereas Postman was worried about people putting too much trust in the outcomes of psychology research. I’ll note that the first author of the Stochastic Parrots paper, Emily Bender, is a linguistics professor. In today’s technopoloy, there are social scientists that are pushing back on the idea that more information is always better.

Overall, the book stands up well, and is even more relevant today than when it was originally published, thirty-odd years ago. While Postman did not foresee the development of LLMs, he recognized that maximizing the amount of accessible information will not be the benefit to mankind that that its proponents claim. That we so rarely hear this position advocated is a testament to his claim that we are, indeed, living in a technopoly.

Component defects: RCA vs RE

Let’s play another round where contrast the root-cause-analysis (RCA) perspective to the resilience engineering (RE) perspective. Today’s edition is about the distribution of potentially incident-causing defects across the different components in the system. Here, I’m using RCA nomenclature, since the kinds of defects that an RCA advocate would refer to as a “cause” in the wake of an incident would be called a “contributor” by the RE folks.

Here’s a stylized view of the world from the RCA perspective:

RCA view of distribution of potential incident-causing defects in the system

Note that there are a few particularly problematic components: we should certainly focus our reliability efforts on figuring out which of the components we should be spending our reliability efforts on improving!

Now let’s look at the RE perspective:

RE view of distribution of potential incident-contributing defects in the system

It’s a sea of red! The whole system is absolutely shot through with defects that could contribute to an incident!

Under the RE view, the individual defects aren’t sufficient to cause an incident. Instead, it’s an interaction of these defects with other things, including other defects. Because incidents arise due to interactions, RE types will stress the importance of understanding interactions across components over the details of the specific component that happened to contain the defect that contributed to the outage. After all, according to RE folks, those defects are absolutely everywhere. Focusing on one particular component won’t yield significant improvements under this model.

If you want to appreciate the RE perspective, you need to develop an understanding how it can be that the system is up right now despite the fact that it is absolutely shot through with all of these potentially incident-causing defects, as the RCA folks would call them. As an RE type myself, I believe that your system is up right now, and that it already contains the defect that will be implicated in the next incident. After that incident happens, the tricky part isn’t identifying the defect, it’s appreciating how the defect alone wasn’t enough to bring it down.