Brief thoughts on the recent Cloudflare outage

I was at QCon SF during the recent Cloudflare outage (I was hosting the Stories Behind the Incidents track), so I hadn’t had a real chance to sit down and do a proper read-through of their public writeup and capture my thoughts until now. As always, I recommend you read through the writeup first before you read my take.

All quotes are from the writeup unless indicated otherwise.

Hello saturation my old friend

The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

One thing I hope readers take away from this blog post is the complex systems failure mode pattern that resilience engineering researchers call saturation. Every complex system out there has limits, no matter how robust that system is. And the systems we deal with have many, many different kinds of limits, some of which you might only learn about once you’ve breached that limit. How well a system is able to perform as it approaches one of its limits is what resilience engineering is all about.

Each module running on our proxy service has a number of limits in place to avoid unbounded memory consumption and to preallocate memory as a performance optimization. In this specific instance, the Bot Management system has a limit on the number of machine learning features that can be used at runtime. Currently that limit is set to 200, well above our current use of ~60 features.

In this particular case, the limit was set explicitly.

thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

As sparse as the panic message is, it does explicitly tell you that the problematic call site was an unwrap call. And this is one of the reasons I’m a fan of explicit limits over implicit limits: you tend to get better error messages than when breaching an implicit limit (e.g., of your language runtime, the operating system, the hardware).

A subsystem designed to protect surprisingly inflicts harm

Identify and mitigate automated traffic to protect your domain from bad bots. – Cloudflare Docs

The problematic behavior was in the Cloudflare Bot Management system. Specifically, it was in the bot scoring functionality, which estimates the likelihood that a request came from a bot rather than a human.

This is a system that is designed to help protect their customer from malicious bots, and yet it ended up hurting their customers in this case rather than helping them.

As I’ve mentioned previously, once your system achieves a certain level of reliability, it’s the protective subsystems that end up being things that bite you! These subsystems are a net positive, they help much more than they hurt. But they also add complexity, and complexity introduces new, confusing failure modes into the system.

The Cloudflare case is a more interesting one than the typical instances of this behavior I’ve seen, because Cloudflare’s whole business model is to offer different kinds of protection, as products for their customers. It’s protection-as-a-service, not an internal system for self-protection. But even though their customers are purchasing this from a vendor rather than building it in-house, it’s still an auxiliary system intended to improve reliability and security.

Confusion in the moment

What impressed me the most about this writeup is that they documented some aspects of what it was like responding to this incident: what they were seeing, and how they tried to made sense of it.

In the internal incident chat room, we were concerned that this might be the continuation of the recent spate of high volume Aisuru DDoS attacks:

Man, if I had a nickel every time I saw someone Slack “Is it DDOS?” in response to a surprising surge of errors returned by the system, I could probably retire at this point.

The spike, and subsequent fluctuations, show our system failing due to loading the incorrect feature file. What’s notable is that our system would then recover for a period. This was very unusual behavior for an internal error.

We humans are excellent at recognizing patterns based on our experience, and that generally serves us well during incidents. Someone who is really good at operations can frequently diagnose the problem very quickly just by, say, the shape of a particular graph on a dashboard, or by seeing a specific symptom and recalling similar failures that happened recently.

However, sometimes we encounter a failure mode that we haven’t seen before, which means that we don’t recognize the signals. Or we might have seen a cluster of problems recently that followed a certain pattern, and assume that the latest one looks like the last one. And these are the hard ones.

This fluctuation made it unclear what was happening as the entire system would recover and then fail again as sometimes good, sometimes bad configuration files were distributed to our network. Initially, this led us to believe this might be caused by an attack. 

This incident was one of those hard ones: the symptoms were confusing. The “problem went away, then came back, then went away again, then came back again” type of unstable incident behavior is generally much harder to diagnose than one where the symptoms are stable.

Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence, it led some of the team diagnosing the issue to believe that an attacker may be targeting both our systems as well as our status page.

Here they got bit by a co-incident, an unrelated failure of their status page that led them to believe (reasonably!) that the problem must have been external.

I’m still curious as to what happened with their status page. The error message they were getting mentions CloudFront, so I assume they were hosting their status page on AWS. But their writeup doesn’t go into any additional detail on what the status page failure mode was.

But the general takeaway here is that even the most experienced operators are going to take longer to deal with a complex, novel failure mode, precisely because it is complex and novel! As the resilience engineering folks say, prepare to be surprised! (Because I promise, it’s going to happen).

A plea: assume local rationality

The writeup included a screenshot of the code that had an unhandled error. Unfortunately, there’s nothing in the writeup that tells us what the programmer was thinking when they wrote that code.

In the absence of any additional information, a natural human reaction is to just assume that the programmer was sloppy. But if you want to actually understand how these sorts of incidents actually happen, you have to fight this reaction.

People always make decisions that make sense to them in the moment, based on what they know and what constraints they are operating under. After all, if that wasn’t true, then they wouldn’t have made that decision. The only we can actually understand the conditions that enable incidents, we need to try as hard as we can to put ourselves into the shoes of the person who made that call, to understand what their frame of mind was at the moment.

If we don’t do that, we risk the problem of distancing through differencing. We say, “oh, those devs were bozos, I would never have made that kind of mistake”. This is a great way to limit how much you can learn from an incident.

Detailed public writeups as evidence of good engineering

The writeup produced by Cloudflare (signed by the CEO, no less!) was impressively detailed. It even includes a screenshot of a snippet of code that contributed to the incident! I can’t recall ever reading another public writeup with that level of detail.

Companies generally err on the side of saying less rather than more. After all, if you provide more detail, you open yourself up to criticism that the failure was due to poor engineering. The fewer details you provide, the fewer things people can call you out on. It’s not hard to find people online criticizing Cloudflare online using the details they provided as the basis for their criticism.

Now, I think it would advance our industry if people held the opposite view: the more details that are provided an incident writeup, the higher esteem we should hold that organization. I respect Cloudflare is an engineering organization a lot more precisely because they are willing to provide these sorts of details. I don’t want to hear what Cloudflare should have done from people who weren’t there, I want to hear us hold other companies up to Cloudflare’s standard for describing the details of a failure mode and the inherently confusing nature of incident response.

My favorite developer productivity research method that nobody uses

You’ve undoubtedly heard of the psychological concept called flow state. This is the feeling you get when you’re in the zone, where you’re doing some sort of task, and you’re just really into it, and you’re focused, and it’s challenging but not frustratingly so. It’s a great feeling. You might experience this with a work task, or a recreational one, like when playing a sport or a video game. The pioneering researcher on the phenomenon of flow was the Hungarian-American psychologist Mihaly Csikszentmihalyi, and he wrote a popular book on the subject back in 1990 with the title Flow: The Psychology of Optimal Experience, which I read many years ago. But the one thing that stuck around most with me from Csikszentmihalyi’s book on Flow was the research method that he used to study flow.

One of the challenges of studying people’s experiences is that it’s difficult for researchers to observe them directly. This problem comes up when an organization tries to do research on the current state of developer productivity within the organizations. I harp on “make work visible” a lot because so much of the work we do in the software world is so hard for others to see. There are different data collection techniques that developer productivity researchers use, including surveys, interviews, focus groups, as well as automatic collection of metrics, like the DORA metrics. Of those, only the automatic collection of metrics focuses on in-the-moment data, and it’s a very thin type of data at that. Those metrics can’t give you any insights into the challenges that your developers are facing.

My preferred technique is the case study, which I try to apply to incidents. I like the incident case study technique because it gives us an opportunity to go deep into the nature of the work for a specific episode. But incident-as-case-study only works for, well, incidents, and while a well-done incident case study can shine a light on the nature of the development work, there’s also a lot that it will miss.

Csikszentmihalyi used a very clever approach which was developed by his PhD student Suzanne Prescott, called experience sampling. He gave the participants of his study pagers, and he would page them at random times. When paged, the participants would write down information about their experiences in a journal in-the-moment. In this way, he was able to collect information about subjective experience, without the problems you get when trying to elicit an account retrospectively.

I’ve never read about anybody trying to use this approach to study developer productivity, and I think that’s a shame. It’s something I’ve wanted to try myself, except that I have not worked in the developer productivity space for a long, long time.

These days, I’d probably use slack rather than a pager and journal to randomly reach out to the volunteers during the study and collect their responses, but the principle is the same. I’ve long wanted to capture an “are you currently banging your head against a wall” metric from developers, but with experience sampling, you could capture a “what are you currently banging your head against the wall about?”

Would this research technique actually work for studying developer productivity issues within an organization? I honestly don’t know. But I’d love to see someone try.


Note: I originally had the incorrect publication date for the Flow book. Thanks to Daniel Miller for the correction.

Easy will always trump simple

One of the early criticisms of Darwin’s theory of evolution by natural selection was about how it could account for the development of complex biological structures. It’s often not obvious to us how the earlier forms of some biological organ would have increase fitness. “What use”, asked the 19th century English biologist St. George Jackson Mivart, “is half a wing?”

One possible answer is that while half a wing might not be useful for flying, it may have had a different function, and evolution eventually repurposed that half-wing for flight. This concept, that evolution can take some existing trait in an organism that serves a function and repurpose it to serve a different function, is called exaptation.

Biology seems to be quite good at using the resources that it has at hand in order to solve problems. Not too long ago, I wrote a review of the book How Life Works: A User’s Guide to the New Biology by the British science writer Philip Ball. One of the main themes of the book is how biologists’ view of genes has shifted over time from the idea DNA-as-blueprint to DNA-as-toolbox. Biological organisms are able to deal effectively with a wide range of challenges by having access to a broad set of tools, which they can deploy as needed based on their circumstances.

We’ll come back to the biology, but for a moment, let’s talk about software design. Back in 2011, Rich Hickey gave a talk at the (sadly defunct) Strange Loop conference with the title Simple Made Easy (transcript, video). In this talk, Hickey drew a distinction between the concepts of simple and easy. Simple is the opposite of complex, where easy is something that’s familiar to us: the term he used to describe the concept of easy that I really liked was at hand. Hickey argues that when we do things that are easy, we can initially move quickly, because we are doing things that we know how to do. However, because easy doesn’t necessarily imply simple, we can end up with unnecessarily complex solutions, which will slow us down in the long run. Hickey instead advocates for building simple systems. According to Hickey, simple and easy aren’t inherently in conflict, but are instead orthogonal. Simple is an absolute concept, and easy is relative to what the software designer already knows.

I enjoy all of Rich Hickey’s talks, and this one is no exception. He’s a fantastic speaker, and I encourage you to listen to it (there are some fun digs at agile and TDD in this one). And I agree with the theme of his talk. But I also think that, no matter how many people listen to this talk and agree with it, easy will always win out over simple. One reason is the ever-present monster that we call production pressure: we’re always under pressure to deliver our work within a certain timeframe, and easier solutions are, by definition, going to be ones that are faster to implement. That means the incentives on software developers tilts the scales heavily towards the easy side. Even more generally, though, easy is just too effective a strategy for solving problems. The late MIT mathematics professor Gian-Carlo Rota noted that every mathematician has only a few tricks, and that includes famous mathematicians like Paul Erdős and David Hilbert.

Let’s look at two specific examples of the application of easy from the software world, specifically, database systems. The first example is about knowledge that is at-hand. Richard Hipp implemented the SQLite v1 as a compiler that would translate SQL into byte code, because he had previous experience building compilers but not building database engines. The second example is about an exaptation, leveraging an implementation that was at-hand. Postgres’s support for multi-version concurrency control (MVCC) relies upon an implementation that was originally designed for other features, such as time-travel queries. (Multi-version support was there from the beginning, but MVCC was only added in version 6.5).

Now, the fact that we rely frequently on easy solutions doesn’t necessarily mean that they are good solutions. After all, the Postgres source I originally linked to has the title The Part of PostgreSQL We Hate the Most. Hickey is right that easy solutions may be fast now, but they will ultimately slow us down, as the complexity accretes in our system over time. Heck, one of the first journal papers that I published was a survey paper on this very topic of software getting more difficult to maintain over time. Any software developer that has worked at a company other than a startup has felt the pain of working with a codebase that is weighed down by what Hickey refers to in his talk as incidental complexity. It’s one of the reasons why startups can move faster than more mature organizations.

But, while companies are slowed down by this complexity, it doesn’t stop them entirely. What Hickey refers to in his talk as complected systems, the resilience engineering researcher David Woods refers to as tangled. In the resilience engineering view, Woods’s tangled, layered networks inevitably arise in complex systems.

Hickey points out that humans can only keep a small number of entities in their head at once, which puts a hard limit on our ability to reason about our systems. But the genuinely surprising thing about complex systems, including the ones that humans build, is that individuals don’t have to understand the system for them to work! It turns out that it’s enough for individuals to only understand parts of the system. Even without anyone having a complete understanding of the whole system, we humans can keep the system up and running, and even extend its functionality over time.

Now, there are scenarios when we do need to bring to bear an understanding of the system that is greater than any one person possesses. My own favorite example is when there’s an incident that involves an interaction between components, where no one person understands all of the components involved. But here’s another thing that human beings can do: we can work together to perform cognitive tasks that none of us could do on their own, and one such task is remediating an incident. This is an example of the power of diversity, as different people have different partial understandings of the system, and we need to bring those together.

To circle back to biology: evolution is terrible at designing simple systems: I think biological systems are the most complex systems that we humans have encountered. And yet, they work astonishingly well. Now, I don’t think that we should design software the way that evolution designs organisms. Like Hickey, I’m a fan of striving for simplicity in design. But I believe that complex systems, whether you call them complected or tangled, are inevitable, they’re just baked in to the fabric of the adaptive universe. I also believe that easy is such a powerful heuristic that it is also baked in to how we build and involved systems. That being said, we should be inspired, by both biology and Hickey, to have useful tools at-hand. We’re going to need them.

Cloudflare and the infinite sadness of migrations

(With apologies to The Smashing Pumpkins)

A few weeks ago, Cloudflare experienced a major outage of their popular 1.1.1.1 public DNS resolver.

On July 14th, 2025, Cloudflare made a change to our service topologies that caused an outage for 1.1.1.1 on the edge, resulting in downtime for 62 minutes for customers using the 1.1.1.1 public DNS Resolver as well as intermittent degradation of service for Gateway DNS.

Cloudflare (@cloudflare.social) 2025-07-16T03:45:10.209Z

Technically, the DNS resolver itself was working just fine: it was (as far as I’m aware) up and running the whole time. The problem was that nobody on the Internet could actually reach it. The Cloudflare public write-up is quite detailed, and I’m not going to summarize it here. I do want to bring up one aspect of their incident, because it’s something I worry about a lot from a reliability perspective: migrations.

Cloudflare’s migration

When this incident struck, Cloudflare supported two different ways of managing what they call service topologies. There was a newer system that supported progressive rollout, and an older system where the changes occurred globally. The Cloudflare incident involved the legacy system, which makes global changes, which is why the blast radius of this incident was so large.

Source: https://blog.cloudflare.com/cloudflare-1-1-1-1-incident-on-july-14-2025/

Cloudflare engineers were clearly aware that these sorts of global changes are dangerous. After all, I’m sure that’s one of the reasons why they built their new system in the first place. But migrating all of the way to the new thing takes time.

Migrations and why I worry about them

If you’ve ever worked at any sort of company that isn’t a startup, you’ve had to deal with a migration. Sometimes a migration impacts only a single team that owns the system in question, but often migrations are changes that are large in scope (typically touching many teams) which, while providing new capabilities to the organization as a whole, don’t provide much short-term benefit to the teams who have to make a change to accommodate the migration.

A migration is a kind of change that, almost by definition, the system wasn’t originally designed to accommodate. We build our systems to support making certain types of future changes, and migrations are exactly not these kinds of changes. Each migration is typically a one-off type of change. While you’ll see many migrations if you work at a more mature tech company, each one will be different enough that you won’t be able to leverage common tooling from one migration to help make the next one easier.

All of this adds up to reliability risk. While a migration-related change wasn’t a factor in the Cloudflare incident, I believe that such changes are inherently risky, because you’re making a one-off change to the way that your system works. Developers generally have a sense that these sorts of changes are risky. As a consequence, for an individual on a team who has to do work to support somebody else’s migration, all of the incentives push them towards dragging their feet: making the migration-related change takes time away from their normal work, and increases the risk they break something. On the other hand, completing the migration generally doesn’t provide them short-term benefit. The costs typically outweigh the benefits. And so all of the forces push towards migrations taking a long time.

But a delay in implementing a migration is also a reliability risk, since migrations are often used to improve the reliability of the system. The Cloudflare incident is a perfect example of this: the newer system was safer than the old one, because it supported staged rollout. And while they ran the new system, they had to run the old one as well.

Why run one system when you can run two?

The scariest type of migration to me is the big bang migration, where you cut over all at once from the old system to the new one. Sometimes you have no choice, but it’s an approach that I personally would avoid whenever possible. The alternative is to do incremental migration, migrating parts of the system over time. To do incremental migration, you need to run the old system and the new system concurrently, until you’ve completely finished the migration and can shut the old system down. When I worked at Netflix, people used the term Roman riding to refer to running the old and new system in parallel, in reference to a style of horseback riding.

What actual Roman riding looks like

The problem with Roman riding is that it’s risky as well. While incremental is safer than big bang, running two systems concurrently increases the complexity of the system. There are many, many opportunities for incidents while you’re in the midst of a migration running the two systems in parallel.

What is to be done?

I wish I had a simple answer here. But my unsatisfying one is that engineering organizations at tech companies need to make migrations a part of their core competency, rather than seeing them as one-off chores. I frequently joke that platform engineering should really be called migration engineering, because any org large enough to do platform engineering is going to be spending a lot of its cycles doing migrations.

Migrations are also unglamorous work: nobody’s clamoring for the title of migration engineer. People want to work on greenfield projects, not deal with the toil of a one-off effort to move the legacy thing onto the new thing. There’s also not a ton written on doing migrations. A notable exception is (fellow TLA+ enthusiast) Marianne Bellotti’s book Kill It With Fire, which sits on my bookshelf, and which I really should re-read.

I’ll end this post with some text from the “Remediation and follow-up steps” of the Cloudflare writeup:

We are implementing the following plan as a result of this incident:

Staging Addressing Deployments: Legacy components do not leverage a gradual, staged deployment methodology. Cloudflare will deprecate these systems which enables modern progressive and health mediated deployment processes to provide earlier indication in a staged manner and rollback accordingly.

Deprecating Legacy Systems: We are currently in an intermediate state in which current and legacy components need to be updated concurrently, so we will be migrating addressing systems away from risky deployment methodologies like this one. We will accelerate our deprecation of the legacy systems in order to provide higher standards for documentation and test coverage.

I’m sure they’ll prioritize this particular migration because of the attention garnered on it from this incident. But I also bet there are a whole lot more in-flight migrations at Cloudflare, as well as at other companies, that increase complexity through maintaining two systems and delaying moving to the safer thing. What are they actually going to do in order to complete those other migrations more quickly? If it was easy, it would already be done.

Dijkstra never took a biology course

Simplicity is prerequisite for reliability. — Edsger W. Dijkstra

Think about a system whose reliability had significantly improved over some period of time. The first example that comes to my mind is commercial aviation, but I’d encourage you to think of a software system you’re familiar with, either as a user (e.g., Google, AWS) or as a maintainer of a system that’s gotten more reliable over time.

Think of a system where the reliability trend looks like this

Now, for the system you have thought about where its reliability increased over time, think about what the complexity trend looks like over time for that system. I’d wager you’d see a similar sort of trend.

My claim about what the complexity trend looks like over time

Now, in general, increases in complexity don’t lead to increases in reliability. In some cases, engineers make a deliberate decision to trade off reliability for new capabilities. The telephone system today is much less reliable than it was when I was younger. As someone who grew up in the 80s and 90s, the phone system was so reliable that it was shocking to pick up the phone and not hear a dial tone. We were more likely to experience a power failure than a telephony outage, and the phones still worked when the power was out! I don’t think we even knew the term “dropped call”. Connectivity issues with cell phones are much more common than they ever were with landlines. But this was a deliberate tradeoff: we gave up some reliability in order to have ubiquitous access to a phone.

Other times, the increase in complexity isn’t the product of an explicit tradeoff but rather an entropy-like effect of a system getting more difficult to deal with over time as it accretes changes. This scenario, the one that most people have in mind when they think about increasing complexity in their system, is synonymous with the idea of tech debt. With tech debt the increase in complexity makes the system less reliable, because the risk of making a breaking change in the system has increased. I started this blog post with a quote from Dijkstra about simplicity. Here’s another one, along the same lines, from C.A.R. Hoare’s Turing Award Lecture in 1980:

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.

What Dijkstra and Hoare are saying is: the easier a software system is to reason about, the more likely it is to be correct. And this is true: when you’re writing a program, the simpler the program is, the more likely that you are to get it right. However, as we scale up from individual programs to systems, this principle breaks down. Let’s see how that happens.

Djikstra claims simplicity is a prerequisite for reliability. According to Dijkstra, if we encounter a system that’s reliable, it must be a simple system, because simplicity is required to achieve reliability.

reliability ⇒ simplicity

The claim I’m making in this post is the exact opposite: systems that improve in reliability do so by adding features that improves reliability, but come at the cost of increased complexity.

reliability ⇒ complexity

Look at classic works on improving the reliability of real-world systems like Michael Nygard’s Release It!, Joe Armstrong’s Making reliable distributed systems in the presence of software errors, and Jim Gray’s Why Do Computers Stop and What Can Be Done About It? and think about the work that we do to make our software systems more reliable, functionality like retries, timeouts, sharding, failovers, rate limiting, back pressure, load shedding, autoscaling, circuit breakers, transactions, and auxiliary systems we have to support our reliability work like an observability stack. All of this stuff adds complexity.

Imagine if I took a working codebase and proposed deleting all of the lines of code that are involved in error handling. I’m very confident that this deletion of code would make the codebase simpler. There’s a reason that programming books tend to avoid error handling cases in their examples, they do increase complexity! But if you were maintaining a reliable software system, I don’t think you’d be happy with me if I submitted a pull request that deleted all of the error handling code.

Let’s look at the natural world, where biology provides us with endless examples of reliable systems. Evolution has designed survival machines that just keep on going; they can heal themselves in simply marvelous ways. We humans haven’t yet figured out how to design systems which can recover from the variety of problems that a living organism can. Simple, though, they are not. They are astonishingly, mind-boggling-y complex. Organisms are the paradigmatic example of complex adaptive systems. However complex you think biology is, it’s actually even more complex than that. Mother nature doesn’t care that humans struggle to understand her design work.

Now, I’m not arguing that this reliability-that-adds-complexity is a good thing. In fact, I’m the first person who will point out that this complexity in service of reliability creates novel risks by enabling new failure modes. What I’m arguing instead is that achieving reliability by pursuing simplicity is a mirage. Yes, we should pay down tech debt and simplify our systems by reducing accidental complexity: there are gains in reliability to be had through this simplifying work. But I’m also arguing that successful systems are always going to get more complex over time, and some of that complexity is due to work that improves reliability. Successful reliable systems are going to inevitably get more complex. Our job isn’t to reduce that complexity, it’s to get better at dealing with it.

Model error

One of the topics I wrote about in my last post was about using formal methods to build a model of how our software behaves. In this post, I want to explore how the software we write itself contains models: models of how the world behaves.

The most obvious area is in our database schemas. These schemas enable us to digitally encode information about some aspect of the world that our software cares about. Heck, we even used to refer to this encoding of information into schemas as data models. Relational modeling is extremely flexible: in principle, we can represent just about any aspect of the world into it, if we put enough effort in. The challenge is that the world is messy, and this messiness significantly increases the effort required to build more complete models. Because we often don’t even recognize the degree of messiness the real world contains, we build over-simplified models that are too neat. This is how we end up with issues like the ones captured in Patrick McKenzie’s essay Falsehoods Programmers Believe About Names. There’s a whole book-length meditation on the messiness of real data and how it poses challenges for database modeling: Data and Reality by William Kent, which is highly recommended by Hillel Wayne, in his post Why You Should Read “Data and Reality”.

The problem of missing the messiness of the real world is not at all unique to software engineers. For example, see Christopher Alexander’s A City Is Not a Tree for a critique of urban planners’ overly simplified view of human interactions in urban environments. For a more expansive lament, check out James C. Scott’s excellent book Seeing Like a State. But, since I’m a software engineer and not an urban planner or a civil servant, I’m going to stick to the software side of things here.

Models in the back, models in the front

In particularly, my own software background is in the back-end/platform/infrastructure space. In this space, the software we write frequently implement control systems. It’s no coincidence that both cybernetics and kubernetes derived their names from the same ancient Greek word: κυβερνήτης. Every control system must contain within it a model of the system that it controls. Or, as Roger C. Conant and W. Ross Ashby put it, every good regulator of a system must be a model of that system.

Things get even more complex on the front-end side of the software world. This world must bridge the software world with the human world. In the context of Richard Cook’s framing in Above the Line, Below the Line, the front-end is the line that bridges the two world. As a consequence, the front-end’s responsibility is to expose a model of the software’s internal state to the user. This means that the front-end also has an implicit model of the users themselves. In the paper Cognitive Systems Engineering: New wine in new bottles, Erik Hollnagel and David Woods referred to this model as the image of the operator.

The dangers of the wrongness of models

There’s an oft-repeated quote by the statistician George E.P. Box: “All models are wrong but some are useful”. It’s a true statement, but one that focuses only on the upside of wrong models, the fact that some of them are useful. There’s also a downside to the fact that all models are wrong: the wrongness of these models can have drastic consequences.

And, while It’s a true statement, but what it fails to hint at how bad the consequences can be when a model is wrong. One of my favorite examples involves the 2008 financial crisis, as detailed by the journalist Felix Salmon’s 2009 Wired Magazine article Recipe for Disaster: The Formula that Killed Wall Street. The article described how Wall Street quants used a mathematical model known as the Gaussian copula function to estimate risk. It was a useful model that ultimately led to disaster.

Here’s A ripped-from-the-headlines example of image of the operator model error, how the U.S. national security advisor Mike Waltz accidentally saved the phone number of Jeffrey Goldberg, editor of the Atlantic magazine, to the contact information of White House spokesman Brian Hughes. The source is the recent Guardian story How the Atlantic’s Jeffrey Goldberg got added to the White House Signal group chat:

According to three people briefed on the internal investigation, Goldberg had emailed the campaign about a story that criticized Trump for his attitude towards wounded service members. To push back against the story, the campaign enlisted the help of Waltz, their national security surrogate.

Goldberg’s email was forwarded to then Trump spokesperson Brian Hughes, who then copied and pasted the content of the email – including the signature block with Goldberg’s phone number – into a text message that he sent to Waltz, so that he could be briefed on the forthcoming story.

Waltz did not ultimately call Goldberg, the people said, but in an extraordinary twist, inadvertently ended up saving Goldberg’s number in his iPhone – under the contact card for Hughes, now the spokesperson for the national security council.

According to the White House, the number was erroneously saved during a “contact suggestion update” by Waltz’s iPhone, which one person described as the function where an iPhone algorithm adds a previously unknown number to an existing contact that it detects may be related.

The software assumed that, when you receive a text from someone with a phone number and email address, that the phone number and email address belong to the sender. This is a model of the user that turned out to be very, very wrong.

Nobody expects model error

Software incidents involve model errors in one way or another, whether it’s an incorrect model of the system being controlled, an incorrect image of the operator, or a combination of the two.

And, yet, despite us all intoning “all models are wrong, some models are useful”, we don’t internalize that our systems our built on top of imperfect models. This is one of the ironies of AI: we are now all aware of the risks associated with model error with LLMs. We’ve even come up with a separate term for it: hallucinations. But traditional software is just as vulnerable to model error as LLMs are, because our software is always built on top of models that are guaranteed to be incomplete.

You’re probably familiar with the term black swan, popularized by the acerbic public intellectual Nassim Nicholas Taleb. While his first book, Fooled by Randomness, was a success, it was the publication of The Black Swan that made Taleb a household name, and introduced the public to the concept of black swans. While the term black swan was novel, the idea it referred to was not. Back in the 1980s, the researcher Zvi Lanir used a different term: fundamental surprise. Here’s an excerpt of a Richard Cook lecture on the 1999 Tokaimura nuclear accident where he talks about this sort of surprise (skip to the 45 minute mark).

And this Tokaimura accident was an impossible accident.

There’s an old joke about the creator of the first English American dictionary, Noah Webster … coming home to his house and finding his wife in bed with another man. And she says to him, as he walks in the door, she says, “You’ve surprised me”. And he says, “Madam, you have astonished me”.

The difference was that she of course knew what was going on, and so she could be surprised by him. But he was astonished. He had never considered this as a possibility.

And the Tokaimura was an astonishment or what some, what Zev Lanir and others have called a fundamental surprise which means a surprise that is fundamental in the sense that until you actually see it, you cannot believe that it is possible. It’s one of those “I can’t believe this has happened”. Not, “Oh, I always knew this was a possibility and I’ve never seen it before” like your first case of malignant hyperthermia, if you’re a an anesthesiologist or something like that. It’s where you see something that you just didn’t believe was possible. Some people would call it the Black Swan.

Black swans, astonishment, fundamental surprise, these are all synonyms for model error.

And these sorts of surprises are going to keep happening to us, because our models are always wrong. The question is: in the wake of the next incident, will we learn to recognize that fundamental surprises will keep happening to us in the future? Or will we simply patch up the exposed problems in our existing models and move on?

Models, models every where, so let’s have a think

If you’re a regular reader of this blog, you’ll have noticed that I tend to write about two topics in particular:

  1. Resilience engineering
  2. Formal methods

I haven’t found many people who share both of these interests.

At one level, this isn’t surprising. Formal methods people tend to have an analytic outlook, and resilience engineering people tend to have a synthetic outlook. You can see the clear distinction between these two perspectives in the transcript of Leslie Lamport’s talk entitled The Future of Computing: Logic or Biology. Lamport is clearly on the side of the logic, so much so that he ridicules the very idea of taking a biological perspective on software systems. By contrast, resilience engineering types actively look to biology for inspiration on understanding resilience in complex adaptive systems. A great example of this is the late Richard Cook’s talk on The Resilience of Bone.

And yet, the two fields both have something in common: they both recognize the value of creating explicit models of aspects of systems that are not typically modeled.

You use formal methods to build a model of some aspect of your software system, in order to help you reason about its behavior. A formal model of a software system is a partial one, typically only a very small part of the system. That’s because it takes effort to build and validate these models: the larger the model, the more effort it takes. We typically focus our models on a part of the system that humans aren’t particularly good at reasoning about unaided, such as concurrent or distributed algorithms.

The act of creating and explicit model and observing its behavior with a model checker gives you a new perspective on the system being modeled, because the explicit modeling forces you to think about aspects that you likely wouldn’t have considered. You won’t say “I never imagined X could happen” when building this type of formal model, because it forces you to explicitly think about what would happen in situations that you can gloss over when writing a program in a traditional programming language. While the scope of a formal model is small, you have to exhaustively specify the thing within the scope you’ve defined: there’s no place to hide.

Resilience engineering is also concerned with explicit models, in two different ways. In one way, resilience engineering stresses the inherent limits of models for reasoning about complex systems (c.f., itsonlyamodel.com). Every model is incomplete in potentially dangerous ways, and every incident can be seen through the lens of model error: some model that we had about the behavior of the system turned out to be incorrect in a dangerous way.

But beyond the limits of models, what I find fascinating about resilience engineering is the emphasis on explicitly modeling aspects of the system that are frequently ignored by traditional analytic perspectives. Two kinds of models that come up frequently in resilience engineering are mental models and models of work.

A resilience engineering perspective on an incident will look to make explicit aspects of the practitioners’ mental models, both in the events that led up to that incident, and in the response to the incident. When we ask “How did the decision make sense at the time?“, we’re trying to build a deeper understanding of someone else’s state of mind. We’re explicitly trying to build a descriptive model of how people made decisions, based on what information they had access to, their beliefs about the world, and the constraints that they were under. This is a meta sort of model, a model of a mental model, because we’re trying to reason about how somebody else reasoned about events that occurred in the past.

A resilience engineering perspective on incidents will also try to build an explicit model of how work happens in an organization. You’ll often heard the short-hand phrase work-as-imagined vs work-as-done to get at this modeling, where it’s the work-as-done that is the model that we’re after. The resilience engineering perspective asserts that the documented processes of how work is supposed to happen is not an accurate model of how work actually happens, and that the deviation between the two is generally successful, which is why it persists. From resilience engineering types, you’ll hear questions in incident reviews that try to elicit some more details about how the work really happens.

Like in formal methods, resilience engineering models only get at a small part of the overall system. There’s no way we can build complete models of people’s mental models, or generate complete descriptions of how they do their work. But that’s ok. Because, like the models in formal methods, the goal is not completeness, but insight. Whether we’re building a formal model of a software system, or participating in a post-incident review meeting, we’re trying to get the maximum amount of insight for the modeling effort that we put in.

Linearizability! Refinement! Prophecy!

Back in August, Murat Derimbas published a blog post about the paper by Herlihy and Wing that first introduced the concept of linearizability. When we move from sequential programs to concurrent ones, we need to extend our concept of what “correct” means to account for the fact that operations from different threads can overlap in time. Linearizability is the strongest consistency model for single-object systems, which means that it’s the one that aligns closest to our intuitions. Other models are weaker and, hence, will permit anomalies that violate human intuition about how systems should behave.

Beyond introducing linearizability, one of the things that Herlihy and Wing do in this paper is provide an implementation of a linearizable queue whose correctness cannot be demonstrated using an approach known as refinement mapping. At the time the paper was published, it was believed that it was always possible to use refinement mapping to prove that one specification implemented another, and this paper motivated Leslie Lamport and Martín Abadi to propose the concept of prophecy variables.

I have long been fascinated by the concept of prophecy variables, but when I learned about them, I still couldn’t figure out how to use them to prove that the queue implementation in the Herlihy and Wing paper is linearizable. (I even asked Leslie Lamport about it at the 2021 TLA+ conference).

Recently, Lamport published a book called The Science of Concurrent Programs that describes in detail how to use prophecy variables to do the refinement mapping for the queue in the Herlihy and Wing paper. Because the best way to learn something is to explain it, I wanted to write a blog post about this.

In this post, I’m going to assume that readers have no prior knowledge about TLA+ or linearizability. What I want to do here is provide the reader with some intuition about the important concepts, enough to interest people to read further. There’s a lot of conceptual ground to cover: to understand prophecy variables and why they’re needed for the queue implementation in the Herlihy and Wing paper requires an understanding of refinement mapping. Understanding refinement mapping requires understanding the state-machine model that TLA+ uses for modeling programs and systems. We’ll also need to cover what linearizability actually is.

We’ll going to start all of the way at the beginning: describing what it is that a program should do.

What does it mean for a program to be correct?

Think of an abstract data type (ADT) such as a stack, queue, or map. Each ADT defines a set of operations. For a stack, it’s push and pop , for a queue, it’s enqueue and dequeue, and for a map, it’s get, set, and delete.

Let’s focus on the queue, which will be a running example throughout this blog post, and is the ADT that is the primary example in the linearizability paper. Informally, we can say that dequeue returns the oldest enqueued value that has not been dequeued yet. It’s sometimes called a “FIFO” because it exhibits first-in-first-out behavior. But how do we describe this formally?

Think about how we would test that a given queue implementation behaves the way we expect. One approach is write a test that consists of a history of enqueue and dequeue operations, and check if our queue returns the expected values.

Here’s an example of an execution history, where enq is the enqueue operation and deq is the dequeue operation. Here I assume that enq does not return a value.

enq("A")
enq("B")
deq() → "A"
enq("C")
deq() → "B"
deq() → "C"

If we have a queue implementation, we can make these calls against our implementation and check that, at each step in the history, the operation returns the expected value, something like this:

Queue q = new Queue();
q.enq("A");
q.enq("B");
assertEquals("A", q.deq());
q.enq("C");
assertEquals("B", q.deq());
assertEquals("C", q.deq());

Of course, a single execution history is not sufficient to determine the correctness of our queue implementation. But we can describe the set of every possible valid execution history for a queue. The size of this set is infinite, so we can’t explicitly specify each history like we did above. But we can come up with a mathematical description of the set of every possible valid execution history, even though it’s an infinite set.

Specifying valid execution histories: the transition-axiom method

In order to specify how our system should behave, we need a way of describing all of its valid execution histories. We are particularly interested in a specification approach that works for concurrent and distributed systems, since those systems have historically proven to be notoriously difficult for humans to reason about.

In the 1980s, Leslie Lamport introduced a specification approach that he called the transition-axiom method. He later designed TLA+ as a language to support specifying systems using the transition-axiom method.

The transition-axiom method uses a state-machine model to describe a system. You describe a system by describing:

  1. The set of valid initial states
  2. The set of valid state transitions

(Aside: I’m not covering the more advanced topic of liveness in this post).

A set of related state transitions is referred to as an action. We use actions in TLA+ to model the events we care about (e.g., calling a function, sending a message).

With a state-machine description, we can generate all sequences that start at one of the initial states and transition according to the allowed transitions. A sequence of states is called a behavior. A pair of successive states is called a step.

Each step in a behavior must be a member of one of the actions. In the diagram above, we would call the first step an A-step because it is a step that is a member of the A action.

We refer to the set that includes all of the actions as the next-state action, which is typically called Next in TLA+ specifications.

In the example above, we would say that A, B, C are sub-actions of the Next action.

We call the entire state-machine description a specification: it defines the set of all allowed behaviors.


To make things concrete, let’s start with a simple example: a counter.

Modeling a counter with TLA+

Consider a counter abstract data type, that has only two operations:

  • inc – increment the counter
  • get – return the current value of the counter
  • reset – return the value of the counter to zero

Here’s an example execution history.

inc()
inc()
get() → 2
get() → 2
reset()
get() → 0

To model this counter in TLA+, we need to model the different operation types (inc, get, reset). We also need to model the return value for the get operation. I’ll model the operation with a state variable named op, and the return value with a state variable named rval.

But there’s one more thing we need to add to our model. In a state-machine model, we model an operation using one or more state transitions (steps) where at least one variable in the state changes. This is because all TLA+ models must allow what are called stuttering steps, where you have a state transition where none of the variables change.

This means we need to distinguish between two consecutive inc operations versus an inc operation followed by a stuttering step where nothing happens.

To do that, I’ll add a third variable to my model, which I’ll unimaginatively call flag. It’s a boolean variable, which I will toggle every time an operation happens. To sum up, my three state variables are:

  • op – the operation (“inc”, “get”, “reset”), which I’ll initialize to “” (empty string) in the first state
  • rval – the return value for a get operation. It will be a special value called none for all of the other operations
  • flag – a boolean that toggles on every (non-stuttering) state transition.

Below is a depiction of an execution history and how this would be modeled as a behavior at TLA+. The text in red indicated which variable changed in the transition. As mentioned above, every transition associated with an operation must have at least one variable that changes value.

Here’s a visual depiction of an execution history. Note how each event in the history is modeled as a step (pair of states) where at least one variable changes.

To illustrate why we need the extra variables, consider the following three behaviors.

In behavior 1, there are no stuttering steps. In behavior 2, the last step is a stuttering step, so there is only one “get” invocation. In behavior 3, there are two stuttering steps.

The internal variables

Our model of a counter so far has defined the external variables, which are the only variables that we really care about as the consumer of a specification. If you gave me a set of all of the valid behaviors for a queue, where behaviors were described using only these external behaviors, that’s all I need to understand how a queue behaves.

However, the external variables aren’t sufficient for the producer of a specification to actually generate the set of valid behaviors. This is because we need to keep track of some additional state information: how many increments there have been since the last reset. This type of variable is known as an internal state variable. I’m going to call this particular internal state variable c.

Here’s behavior 1, with different color codings for the external and internal variables.

The actions

Here is a visual depiction of the permitted state transitions. Recall the set of permitted state transitions is called an action. For our counter, there are three actions, which corresponds to the three different operations we model: inc, get, and reset.

Each transition is depicted as two boxes with variables in it. The left-hand box shows the values of the variables before the state transition, and the right-hand box shows the values of the variables after the state transition. By convention we add a prime (‘) to the variables to refer to their values after the state transition.

While the diagram depicts three actions, each action describes a set of allowed state transitions. As an example, here are two different state transitions that are both members of the inc set of permitted transitions.

  1. [flag=TRUE, c=5] → [flag=FALSE, c=6]
  2. [flag=TRUE, c=8]→ [flag=FALSE, c=9]

In TLA+ terminology, we call these two steps inc steps. Remember: in TLA+, all of the action is in the actions. We use actions (sets of permitted state transitions) to model the events that we care about.

Modeling a queue with TLA+

We’ll move on to our second example, which will form the basis for the rest of this post: a queue. A queue supports two operations, which I’ll call enq (for enqueue) and deq (for dequeue).

Modeling execution histories as behaviors

Recall our example of a valid execution history for a queue:

enq("A")
enq("B")
deq() → "A"
enq("C")
deq() → "B"

We now have to model argument passing, since the enq operation takes an argument.

Here’s one way to model this execution history as a TLA+ behavior.

My model uses three state variables:

  • op – identifies which operation is being invoked (enq or deq)
  • arg – the argument being passed in the case of the enq operation
  • rval – the return value in the case of the deq operatoin

TLA+ requires that we specify a value for every variable in every state, which means we need to specify a value for arg even for the deq operation, which doesn’t have an argument, and a value for rval for the enq operation, which doesn’t return a value. I defined a special value called none for this case.

In the first state, when the queue is empty, I chose to set op to the empty string (“”) and arg and rval to none.

The internal variables

For a queue, we need to keep track all of the values that have previously been enqueued, as well as the order in which they were enqueued.

TLA+ has a type called a sequence which I’ll use to encode this information: a sequence is like a list in Python.

I’ll add a new variable which I’ll unimaginatively call d, for data. Here’s what that behavior looks like with the internal variable.

Modeling dequeues

Recall that our queue supports two operations: enqueue and dequeue. We’ll start with the dequeue operation. I modeled it with an action called Deq.

Here are some examples of state transitions that are permitted by the Deq action. We call these Deq steps.

I’m not going to write much TLA+ code in this post, but to give you a feel for it, here is how you would write the Deq action in TLA+ syntax:

Deq == /\ d # <<>>
       /\ op' = "deq"
       /\ arg' = none
       /\ rval' = Head(d)
       /\ d' = Tail(d)

The syntax of the first line might be a bit confusing if you’re not familiar with TLA+:

  • # is TLA+ for ≠
  • <<>> is TLA+ for the empty sequence.

Modeling dequeues

Here’s what the Enq action looks like:

There’s non-determinism in this action: the value of arg’ can be any valid value that we are allowed to put onto the queue.

I’ll spend just a little time in this section to give you a sense of how you would use TLA+ to represent the simple queue model.

To describe the queue in TLA+, we define a set called Values that contains all of the valid values that could be enqueued, as well as a special constant named none that means “not a value”.

CONSTANTS Values, none
ASSUME none \notin Values

Then we would encode the Enq action like this:

Enq == /\ op' = "enq"
       /\ arg' \in Values
       /\ rval' = none
       /\ d' = Append(d, arg')

The complete description of our queue, its specification that describes all permitted behaviors, looks like this:

For completeness, here’s what the TLA+ specification looks like: (source in Queue.tla).

Init corresponds to our set of initial states, and Next corresponds to the next-state action, where the two sub-actions are Enq and Deq.

The last line, Spec, is the full specification. You can read this as: The initial state is chosen from the Init set of states, and every step is a Next step (every allowed state transition is a member of the set of state transitions defined by the Next action).

Modeling concurrency

In our queue model above, an enqueue or dequeue operation happens in one step (state transition). That’s fine for modeling sequential programs, but it’s not sufficient for modeling concurrent programs. In concurrent programs, the operations from two different threads can overlap in time.

To illustrate, imagine a scenario where there are three threads, t1, t2, t3. First, t1 enqueues “A”, and “B”. Then, t2 and t3 both call dequeue, and those queries overlap in time.

We want to model concurrent executions using a state-machine model. The diagram above, shows the start and end time for each operation. But to model this behavior, we don’t actually care about the exact start and end times: rather, we only care about the relative order of the start and events.

Below shows the threads in different columns.

       t1                t2                t3
---------------   ----------------  ----------------
enq("A") [start]
enq("A") [end]
enq("B") [start]
enq("B") [end]
                  deq() [start]
                                    deq() [start]
                  deq() → "B" [end]
                                    deq() → "A" [end]

Here’s the same execution history, shown in a single column:

t1: enq("A") [start]
t1: enq()    [end]
t1: enq("B") [start]
t1: enq("B") [end]
t2: deq() [start]
t3: deq() [start]
t2: deq() → "B" [end]
t3: deq() → "A" [end]

We can model execution histories like the one above using state machines. We were previously modeling an operation in a single state transition (step). Now we will need to use two steps to model an operation: one to indicate when the operation starts and the other to indicate when the operate ends.

Because each thread acts independently, we need to model variables that are local to threads. And, in fact, all externally visible variables are scoped to threads, because each operation always happens in the context of a particular thread. We do this by changing the variables to be functions where the domain is a thread id. For example, where we previously had op=”enq” where op was always a string, now op is a function that takes a thread id as an argument. Now we would have op[t1]=”enq” where t1 is a thread id. (Functions in TLA+ use square brackets instead of round ones. you can think of these function variables as acting like dictionaries)

Here’s an example of a behavior that models the above execution history, showing only the external variables. Note that this behavior only shows the values that change in a state.

Note the following changes from the previous behaviors.

  • There is a boolean flag, done, which indicates when the operation is complete.
  • The variables are all scoped to a specific thread.

But what about the internal variable d?

Linearizability as correctness condition for concurrency

We know what it means for a sequential queue to be correct. But what do we want to consider correct when operations can overlap? We need to decide what it means for an execution history of a queue to be correct in the face of overlapping operations. This is where linearizability comes in. From the abstract of the Herlihy and Wing paper:

Linearizability provides the illusion that each operation applied by concurrent processes takes effect instantaneously at some point between its invocation and its response, implying that the meaning of a concurrent object’s operations can be given by pre- and post-conditions.

For our queue example, we say our queue is linearizable if, for every history where there are overlapping operations, we can identify a point in time between the start and end of the operation where the operation instantaneously “takes effect”, giving us a sequential execution history that is a correct execution history for a serial queue. This is a called a linearization. If every execution history for our queue has a linearization, then we say that our queue is linearizable.

To make this concrete, consider the following four observed execution histories of a queue, labeled (a), (b), (c), (d), adapted from Fig. 1 of the Herlihy and Wing linearizable paper. Two of these histories are linearizable (they are labeled “OK”), and two are not (labeled “NOT OK”).

For (a) and (c), we can identify points in time during the operation where it appears as if the operation has instantaneously taken effect.

We now have a strict ordering of operations because there is no overlap, so we can write it as a sequential execution history. When the resulting sequential execution history is valid, it is called a linearization:

(a)
t1: enq("X")
t2: enq("Y")
t2: deq() → "X"
t1: deq() → "Y"
t2: enq("Y")

(b)
t1: enq("X")
t2: deq() → "X"

Modeling a linearizable queue in TLA+

To repeat from the last section, a data structure is linearizable if, for every operation that executes on the data structure, we can identify a point in time between the start and the end of the operation where the operation takes effect.

We can model a linearizable queue by modeling each operation (enqueue/dequeue) as three actions:

  1. Start (invocation) of operation
  2. When the operation takes effect
  3. End (return) of operation

Our model needs to permit all possible linearizations. For example, consider the following two linearizable histories. Note how the start/end timings of the operations are identical in both cases, but the return values are different.

In (1) the first deq operation returns “X”, and in (2) the first deq operation “returns Y”. Yet they are both valid histories. The difference between the two is the order in which the enq operations take effect. In (1), enq(“X”) takes effect before enq(“Y”), and in (2), enq(“Y”) takes effect before enq(“X”). Here are the two linearizations:

(1)
enq("X")
enq("Y")
deq()→"X"
deq()→"Y"

(2)
enq("Y")
enq("X")
deq()→"X"
deq()→"Y"

Our TLA+ model of a linearizable queue will need to be able to model the relative order of when these operations take effect. This is where the internal variables come into play in our model: “taking effect” will mean updating internal variables of our model.

We need an additional variable to indicate whether the internal state has been updated or not for the current operation. I will call this variable up (for “updated”). It starts off as false when the operation starts, and is set to true when the internal state variable (d) has been updated.

Here’s a visual representation of the permitted state transitions (actions). As before, the left bubble shows the values that must be true in the first state for the transition to happen, and the second bubble shows which variables change.

Since we now have to deal with multiple threads, we parameterize our action by thread id (t). You can see the TLA+ model here: LinearizableQueue.tla.

We now have a specification for a linearizable queue, which is a description of all valid behaviors. We can use this to verify that a specific queue implementation is linearizable. To demonstrate, let’s shift gears and talk about an example of an implementation.

An example queue implementation

Let’s consider an implementation of a queue that:

  1. Stores the data in a doubly-linked list
  2. Uses a lock to protect the list

A queue with four entries looks like this:

Here’s an implementation of this queue in Python that I whipped up. I call it an “LLLQueue” for “locked-linked-list queue”. I believe that my LLLQueue is linearizable, and I’d like to verify this.

One way is to use TLA+ to build a specification of my LLLQueue, and then prove that every behavior of my LLLQueue is also a behavior of the LinearizableQueue specification. The way we do this is in TLA+ is by a technique called refinement mappings.

But, first, let’s model the LLLQueue in TLA+.

Modeling the LLLQueue in TLA+ (PlusCal)

In a traditional program, a node would be associated with a pointer or reference. I’m going to use numerical IDs for each node, starting with 1. I’ll use the value of 0 as a sentinel value meaning null.

We’ll model this with three functions:

  • vals – maps node id to the value stored in the node
  • prev – maps node id to the previous node id in the list
  • next – maps node id to the next node id in the list

Here are these functions in table form for the queue shown above:

node idvals
1A
2B
3C
4D
The vals function
node idprev
12
23
34
40 (null)
The prev function
node idnext
10 (null)
21
32
43
The next function

It’s easier for me to use PlusCal to model an LLLQueue than to do it directly in TLA+. PlusCal is a language for specifying algorithms that can be automatically translated to a TLA+ specification.

It would take too much space to describe the full PlusCal model and how it translates, but I’ll try to give a flavor of it. As a reminder, here’s the implementation of the enqueue method in my Python implementation.

    def enqueue(self, val):
        self.lock.acquire()
        new_tail = Node(val=val, next=self.tail)
        if self.is_empty():
            self.head = new_tail
        else:
            self.tail.prev = new_tail
        self.tail = new_tail
        self.lock.release()

Here’s what my PlusCal model looks like for the enqueue operation:

procedure enqueue(val)
variable new_tail;
begin
E1: acquire(lock);
E2: with n \in AllPossibleNodes \ nodes do
        Node(n, val, tail);
        new_tail := n;
    end with;
E3: if IsEmpty then
        head := new_tail;
    else
        prev[tail] := new_tail;
    end if;
    tail := new_tail;
E4: release(lock);
E5: return;
end procedure;

Note the labels (E1, E2, E3, E4, E5) here. The translator turns those labels into TLA+ actions (state transitions permitted by the spec). In my model, an enqueue operation is implemented by five actions.

Refinement mappings

One of the use cases for formal methods is to verify that a (low-level) implementation conforms to a (higher-level) specification. In TLA+, all specs are sets of behaviors, so the way we do this is that we:

  • create a high-level specification that models the desired behavior of the system
  • create a lower-level specification that captures some implementation details of interest
  • show that every behavior of the low-level specification is among the set of behaviors of the higher-level specification, considering only the externally visible variables

In the diagram below, Abs (for abstract) represents the set of valid (externally visible) behaviors of a high-level specification, and Impl (for implementation) represents the set of valid (externally visible) behaviors for a low-level specification. For Impl to implement Abs, the Impl behaviors must be a subset of the Abs behaviors.

We want to be able to prove that Impl implements Abs. In other words, we want to be able to prove that every externally visible behavior in Impl is also an externally visible behavior in Abs.

We want to be able to find a corresponding Abs behavior for every Impl behavior

One approach is to do this by construction: if we can take any behavior in Impl and construct a behavior in Abs with the same externally visible values, then we have proved that Impl implements Abs.

As Lamport and Abadi put it in their paper The Existence of Refinement Mappings back in 1991:

To prove that S1 implements S2, it suffices to prove that if S1 allows the behavior
<<(e0,z0), (e1, z1), (e2, z2), …>>

where [ei is a state of the externally visible component and] the zi are internal states, then there exists internal states yi such that S2 allows
<<(e0,y0), (e1, y1), (e2, y2), … >>

For each behavior B1 in Impl, if we can find values for internal variables in a behavior of Abs, B2, where the external variables of B2 match the external variables of B1, then that’s sufficient to prove that Impl implements Abs.

To show that Impl implements Abs, we need to find a refinement mapping, which is a function that will map every behavior in Impl to a behavior in Abs.

A refinement mapping takes a state in an Impl behavior as input and maps to an Abs state, such that:

  1. the external variables are the same in both the Impl state and the Abs state
  2. if a pair of states is a permitted Impl state transition, then the corresponding mapped pair of states must be a permitted Abs state transition

Or, to reword statement 2: if NextImpl is the next-state action for Impl (i.e., the set of allowed state transitions for Impl), and NextAbs is the next-state action for Abs, then under the refinement mapping, every NextImpl-step must map to a NextAbs step.

(Note: this technically isn’t completely correct, we’ll see why in the next section).

Example: our LLLQueue

We want to verify that our Python queue implementation is linearizable. We’ve modeled our Python queue in TLA+ as LLLQueue, and to prove that it’s linearizable, we need to show that a refinement mapping exists between the LLLQueue spec and the LinearizableQueue spec. This means we need to show that there’s a mapping from LLLQueue’s internal variables to LinearizableQueue’s internal variables.

We need to define the internal variables in LinearizableQueue (up, d) in terms of the variables in LLLQueue (nodes, vals, next, prev, head, tail, lock, new_tail, empty, pc, stack) in such a way that all LLLQueue behaviors are also LinearizableQueue behaviors under the mapping.

Internal variable: d

The internal variable d in LinearizableQueue is a sequence which contains the values of the queue, where the first element of the sequence is the head of the queue.

Looking back at our example LLLQueue queue:

We need a mapping that, for this example, results in: d =〈A,B,C,D 〉

I defined a recursive operator that I named Data such that when you call Data(head), it evaluates to a sequence with the values of the queue.

Internal variable: up

The variable up is a boolean flag that flips from false to true after the value has been added to the queue.

In our LLLQueue model, the new node gets added to the tail by the action E3. In PlusCal models, there’s a variable named pc (program counter) that records the current execution state of the program. You can think of pc like a breakpoint that points to the action that will be executed on the next step of the program. We want up to be true after action E3. You can see how the up mapping is defined at the bottom of the LLLQueue.tla file.

Refinement mapping and stuttering

Let’s consider a behavior of the LLLQueue spec that enqueues a single value onto the queue, with a refinement mapping to the LinearizableQueue spec:

In the LinearizableQueue spec, an enqueue operation is implemented by three actions:

  1. EnqStart
  2. EnqTakesEffect
  3. EnqEnd

In the LLLQueue spec, an enqueue operation is implemented by seven actions: enq, e1, …, e5, enqdone. That means that the LLLQueue enqueue behavior involves eight distinct states, where the corresponding LinearizableQueue behavior involves only four distinct states. Sometimes, different LLLQueue states map to the same LinearizableQueue state. In the figure above, SI2,SI3,SI4 all map to SA2, and SI5,SI6,SI7 all map to SA3. I’ve color-coded the states in the LinearizableQueue behavior such that states that have the same color are identical.

As a result, some state transitions in the refinement mapping are not LinearizableQueue actions, but are instead transitions where none of the variables change at all. These are called stuttering steps. In TLA+, stuttering steps are always permitted in all behaviors.

A problem with refinement mappings: the Herlihy and Wing queue

The last section of the Herlihy and Wing paper describes how to prove that a concurrent data structure’s operations are linearizable. In the process, the authors also point out a problem with refinement mappings. They illustrate the problem using a particular queue implementation, which we’ll call the “Herlihy & Wing queue”, or H&W Queue for short.

Imagine an array of infinite length, where all of the values are initially null. There’s a variable named back which points to the next available free spot in the queue.

Enqueueing

To enqueue a value onto the Herlihy & Wing queue involves two steps:

  1. Increment the back variable
  2. Write the value into the spot where the back variable pointed before being incremented .

Here’s what the queue looks like after three values (A,B,C) have been enqueued:

Note how back always points to the next free spot.

Dequeueing

To dequeue, you start at index 0, and then you sweep through the array, looking for the first non-null value. Then you atomically copy that value out of the array and set the array element to null.

Here’s what a dequeue operation on the queue above would look like:

The Deq operation returned A, and the first element in the array has been set to null.

If you were to enqueue another value (say, D), the array would now look like this:

Note: the elements at the beginning of the queue that get set to null after a dequeue don’t get reclaimed. The authors note that this is inefficient, but the purpose of this queue is to illustrate a particular issue with refinement mappings, not to be a practical queue implementation.

H&W Queue pseudocode

Here’s the pseudocode for Herlihy & Wing queue, which I copied directly from the paper. The two operations are Enq (enqueue) and Deq (dequeue).

rep = record {back: int, items: array [item]}

Enq = proc (q: queue, x: item)
  i: int := INC(q.back) % Allocate a new slot
  STORE (q.items[i], x) % Fill it.
  end Enq

Deq = proc (q: queue) returns (item)
  while true do
    range: int := READ(q.back) - 1
    for i: int in 1 .. range do
      x: item := SWAP(q.items[i], null)
      if x ~= null then return(x) end
      end
    end
end Deq

This algorithm relies on the following atomic operations on shared variables:

  • INC – atomically increment a variable and return the pre-incremented value
  • STORE – atomically write an element into the array
  • READ – atomically read an element in the array (copy the value to a local variable)
  • SWAP – atomically write an element of an array and return the previous array value

H&W Queue implementation in C++

Here’s my attempt at implementing this queue using C++. I chose C++ because of its support for atomic types. C++’s atomic types support all four of the atomic operations required of the H&W queue.

Atomic operationDescriptionC++ equivalent
INCatomically increment a variable and return the pre-incremented valuestd::atomic<T>::fetch_add
STOREatomically write an element into the arraystd::atomic_store
READ atomically read an element in the array (copy the value to a local variable)std::atomic_load
SWAPatomically write an element of an array and return the previous array valuestd::atomic_exchange

My queue implementation stores pointers to objects of parameterized type T. Note the atomic types of the member variables. The back variable and elements of the items array need to be atomics because we will be invoking atomic operations on them.

template <typename T>
class Queue {
    atomic<int> back;
    atomic<T *> *items;

public:
    Queue(int sz) : back(0), items(new atomic<T *>[sz]) {}
    ~Queue() { delete[] items; }

    void enq(T *x);
    T *deq();
};

template<typename T>
void Queue<T>::enq(T *x) {
    int i = back.fetch_add(1);
    std::atomic_store(&items[i], x);
}

template<typename T>
T *Queue<T>::deq() {
    while (true) {
        int range = std::atomic_load(&back);
        for (int i = 0; i < range; ++i) {
            T *x = std::atomic_exchange(&items[i], nullptr);
            if (x != nullptr) return x;
        }
    }
}

We can write enq and deq to look more like idiomatic C++ by using the following atomic operators:

Using these operators, enq and deq look like this:

template<typename T>
void Queue<T>::enq(T *x) {
    int i = back++;
    items[i] = x;
}

template<typename T>
T *Queue<T>::deq() {
    while (true) {
        int range = back;
        for (int i = 0; i < range; ++i) {
            T *x = std::atomic_exchange(&items[i], nullptr);
            if (x != nullptr) return x;
        }
    }
}

Note that this is, indeed, a linearizable queue, even though it does not use mutual exclusion: tthere are no critical sections in the algorithm.

Modeling the H&W queue in TLA+ with PlusCal

The H&W queue is straightforward to model in PlusCal. If you’re interested in learning PlusCal, it’s actually a great example to use. See HWQueue.tla for my implementation.

Refinement mapping challenge: what’s the state of the queue?

Note how the enq method isn’t an atomic operation. Rather, it’s made up of two atomic operations:

  1. Increment back
  2. Store the element in the array

Now, imagine that a thread, t1, comes along, to enqueue a value to the queue. It starts off by incrementing back.

But before it can continue, a new thread, t2, gets scheduled, which also increments back:

t2 then completes the operation:

Finally, a new thread, t3, comes along that executes the dequeue operation:

Example state of the H&W queue

Now, here’s the question: What value will the pending Deq() operation return: A or B?

The answer: it depends on how the threads t1 and t3 will be scheduled. If t1 is scheduled first, it will write A to position 0, and then t3 will read it. On the other hand, if t3 is scheduled first, it will advance its i pointer to the next non-null value, which is position 1, and return B.

Recall back in the section “Modeling a queue with TLA+ > The internal variables” that our model for a linearizable queue had an internal variable, named d, that contained the elements of the queue in the order in which they had been enqueued.

If we were to write a refinement mapping of this implementation to our linearizable specification, for the state above, we’d have to decide whether the mapping for the above state should be. The problem is that no such refinement mapping exists.

Here are the only options that make sense for the example above:

  1. d =〈B〉
  2. d =〈A,B〉
  3. d =〈B,A〉

As a reminder, here are the valid state transitions for the LinearizableQueue spec.

Option 1: d =〈B〉

Let’s say we define our refinement mapping by using the populated elements of the queue. That would result in a mapping of d =〈B〉.

The problem is that if t1 gets scheduled first and adds value A to array position 0, then an element will be added to the head of d. But the only LinearizableQueue action that adds an element to d is EnqTakeEffect, which adds a value to the to the end of d. There is no LinearizableQueue action that allows prepending to d, so this cannot be a valid refinement mapping.

Option 2: d =〈A,B〉

Let’s say we had chosen instead a refinement mapping of d =〈A,B〉for the state above. In that case, if t3 gets scheduled first, then it will result in a value being removed from the end of d, which is not one of the actions of the LinearizableQueueSpec, which means that this can’t be a valid refinement mapping either.

Option 3: d =〈B,A〉

Finally, assume we had chosen d =〈B,A〉as our refinement mapping. Then, if t1 gets scheduled first, and then t3, we will end up with a state transition that removes an element from the end of d, which is not a LinearizableQueue action.

Whichever refinement mapping we choose, it is possible that the resulting behavior will violate the LinearizableQueue spec. This means that we can’t come up with a refinement mapping where every behavior of Impl maps to a valid behavior of Abs, even though Impl implements a linearizable queue!

What Lamport and others believed at the time was that this type of refinement mapping always existed if Impl did indeed implement Abs. With this counterexample, Herlihy & Wing demonstrated that this was not always the case.

Elements in H&W queue aren’t totally ordered

In a typical queue implementation, there is a total ordering of elements that have been enqueued. The odd thing about the Herlihy & Wing queue is that this isn’t the case.

If we look back at our example above:

If t1 is scheduled first, A is dequeued next. If t3 is scheduled first, B is dequeued next.

Either A or B might be dequeued next, depending on the ordering of t1 and t3. Here’s another example where the value dequeued next depends on the ordering of the threads t1 and t3.

If t2 is scheduled first, B is dequeued next. If t3 is scheduled first, A is dequeued next.

However, there are also scenarios where there is a clear ordering among values that have been added to the queue. Consider a case similar to the one above, except that t2 has not yet incremented the back variable:

In this configuration, A is guaranteed to be dequeued before B. More generally, if t1 writes A to the array before t2 increments the back variable, then A is guaranteed to be dequeued before B.

In the linearizability paper, Herlihy & Wing use a mapping approach where they identify a set of possible mappings rather than a single mapping.

Let’s think back to this scenario:

If t1 is scheduled first, A is dequeued next. If t3 is scheduled first, B is dequeued next.

In the scenario above, in the Herlihy and Wing approach, the mapping would be to the set of all possible values of queue.

  • queue ∈ {〈B〉,〈A,B〉, 〈B,A〉}

Lamport took a different approach to resolving this issue. He rescued the idea of refinement mappings by introducing a concept called prophecy variables

Prophecy

The Herlihy & Wing queue’s behavior is non-deterministic: we don’t know the order in which values will be dequeued, because it depends on the scheduling of the threads. But imagine if we know in advance the order in which the values were dequeued.

It turns out that if we can predict the order in which the values would be dequeued, then we can do a refinement mapping to the our LinearizableQueue model.

This is the idea behind prophecy variables: we predict certain values that we need for refinement mapping.

Adding a prophecy variable gives us another specification (one which has a new variable), and this is the specification where we can define a refinement mapping. For example, we add a prophecy to our HWQueue model and call the new model HWQueueProphecy.

In HWQueueProphecy, we maintain a sequence of the predicted order in which values will be dequeued. Every time a thread invokes the enqueue operation, we add a new value to our sequence of predicted dequeue operations.

The predicted value is chosen at random from the set of all possible values: it is not necessarily related to either the value currently being enqueued or the current state of the queue.

Now, these predictions might not actually come true. In fact, they almost certainly won’t come true, because we’re much more likely to predict at least one value incorrectly. In the example above, the actual dequeueing order will be〈A,B,C〉, which is different from the predicted dequeueing order of〈Q,W,E〉

However, the refinement mapping will still work, even though the predictions will often be wrong, if we set things up correctly. We’ll tackle that next.

Prophecy requirements

We want to show that HWQueue implements LinearizableQueue. But it’s only HWQueueProphecy that we can show implements LinearizableQueue using a refinement mapping.

1. A correct prophecy must exist for every HWQueue behavior

Every behavior in HWQueue must have a corresponding behavior in HWQueueProphecy. That correspondence happens when the prophecy accurately predicts the dequeueing order.

This means that, for each behavior in HWQueue, there must be a behavior in HWQueueProphecy which is identical except that the HWQueueProphecy behaviors have an additional p variable with the prophecy.

To ensure that a correct prophecy always exists, we just make sure that we always predict from the set of all possible values.

In the case of HWQueueProphecy, we are always enqueueing values from the set {A,…,Z}, and so as long as we draw predictions from the set {A,…,Z}, we are guaranteed that the correct prediction is among the set we are predicting from.

2. Every HWQueueProphecy behavior with an incorrect prophecy must correspond to at least one HWQueue behavior

Most of the time, our predictions will be incorrect. We need to ensure that, when we prophesize incorrectly, the resulting behavior is still a valid HWQueue behavior, and is also still a valid LinearizableQueue behavior under refinement.

We do this by writing our HWQueueProphecy specification such that, if our prediction turns out to be incorrect (e.g., we predict A as the next value to be dequeued, and the next value that will actually be dequeued is B), we disallow the dequeue from happening.

In other words, we disallow state transitions that would violate our predictions.

This means we add a new enabling condition to one of the dequeue actions. Now, for the dequeueing thread to remove the value from the array, it has to match the next value in the predicted sequence. In HWQueueProphecy, the name of the action that does this is D5, where t is id of the thread doing the dequeueing.

An incorrect prophecy blocks the dequeueing from actually happening. In the case of HWQueueProphecy, we can still enqueue values (since we don’t make any predictions on enqueue order, only dequeue order, so there’s nothing to block).

But let’s consider the more interesting case where an incorrect prophecy results in deadlock, where no actions are enabled anymore. This means that the only possible future steps in the behavior are stuttering steps, where the values never change.

When a prophecy is incorrect, it can result in deadlock, where all future steps are stuttering steps. These are still valid behaviors.

However, if we take a valid behavior of a specification, and we add stuttering steps, the resulting behavior is always also a valid behavior of the specification. So, the resulting behaviors are guaranteed to be in the set of valid HWQueue behaviors.

Using the prophecy to do the refinement mapping

Let’s review what we were trying to accomplish here. We want to figure out a refinement mapping from HWQueueProphecy to LinearizableQueue such that every state transition satisfies the LinearizableQueue specification.

Here’s an example, where the value B has been enqueued, the value A is in the process of being enqueued, and no values have been dequeued yet.

Defining how to do this mapping is not obvious, and I’m not going to explain it here, as it would take up way too much space, and this post is already much too long. Instead, I will defer interested readers to section 6.5 of Lamport’s book A Science of Concurrent Programs, which describes how to do the mapping. See also my POFifopq.tla spec, which is my complete implementation of his description.

But I do want to show you something about it.

Enqueue example

Let’s consider this HWQueue behavior, where we are concurrently enqueueing three values (A,B,C):

These enqueue operations all overlap each other, like this:

The refinement mapping will depend on the predictions.

Here’s an example where the predictions are consistent with the values being enqueued. Note how the state of the mapping ends up matching the predicted values.

Notice how in the final state (S10), the refinement mapping d=〈C,A,B〉is identical to the predicted dequeue ordering: p=〈C,A,B〉.

Here the predicted dequeue values p=〈Q,W,E〉can never be fulfilled, and the refinement mapping in this case, d=〈A,C,B〉matches the order in which overlapping enqueue operations complete.

The logic for determining the mapping varies depending on whether it is possible for the predictions to be fulfilled. For more details on this, see A Science of Concurrent Programs.

For the dequeueing case, if the value to be dequeued matches the first prediction in p, then we execute the dequeue and we remove the prediction from p. (We’ve already discussed the behavior if the value to be dequeued doesn’t match the prediction).

Coda

Phew! There was a lot of content in this post, including definitions and worked out examples. It took me a long time to grok a lot of this material, so I suspect that even an interested reader won’t fully absorb it on the first read. But if you got all of the way here, you’re probably interested enough in TLA+ that you’ll continue to read further on it. I personally find that it helps to have multiple explanations and examples, and I’ve tried to make an effort to present these concepts a little differently than other sources, so hopefully you’ll find it useful to come back here after reading elsewhere.

The TLA+ models and Python and C++ code are all available in my GitHub tla-prophecy repo.

References

There’s a ton of material out there, much of it papers.

Blog posts

Linearizability: A Correctness Condition for Concurrent Objects, Murat Demirbas, August 9, 2024. This was the post that originally inspired this blog post.

Specification Refinement, Hillel Wayne, July 13, 2021. An explanation of refinement mapping.

Papers

Linearizability: a correctness condition for concurrent objects. Maurice P. Herlihy, Jeannette M. Wing. ACM Transactions on Programming Languages and Systems (TOPLAS), Volume 12, Issue 3, July 1990. This is the paper that introduced linearizability.

Leslie Lamport has a complete list of his papers on his website. But here are a few that are most directly relevant to this post.

A Simple Approach to Specifying Concurrent Systems. Leslie Lamport. Communications of the ACM, Volume 32, Issue 1, January 1989. This paper is a gentle introduction to the transition-axiom method: Lamport’s state-machine method of specifications.

Existence of refinement mappings. Martin Abadi, Leslie Lamport. Proceedings of the 3rd Annual Symposium on Logic in Computer Science. July 1988. This is the paper that introduced the idea of of refinement mapping.

Prophecy Made Simple. Leslie Lamport, Stephan Merz. ACM Transactions on Programming Languages and Systems (TOPLAS), Volume 44, Issue 2, April 2022. This paper is an introduction to the idea of prophecy variables.

Books (PDFs)

A Science of Concurrent Programs. Leslie Lamport. This is a self-published book on how to use TLA to model concurrent algorithms. This book documents how to use prophecy variables to implement the refinement mapping for the linearizable queue in the Herlihy and Wing paper.

A PlusCal User’s Manual. P-Syntax. Version 1.8. Leslie Lamport. 11 March 2024. This is a good general reference for the PlusCal language.

Websites

Learn TLA+. Hillel Wayne. Website with TLA+ learning materials. A good place to start learning TLA+.

The TLA+ Home Page. The official TLA+ website

Consistency Models. Linearizability is a consistency models, but there are multiple other ones. Kyle Kingsbury provides a good overview of the different models on his Jepsen site.

GitHub repositories

https://github.com/lorin/tla-prophecy – This repo contains the actual refinement mappings for the Herlihy & Wing queue. Much of it is transcribed from A Science of Concurrent Programs.

https://github.com/lorin/tla-linearizability – This repo contains my TLA+ models the formal definition of linearizability from the Herlihy & Wing paper.

Edits:

  • fixed errors in counter execution history and “Example of Deq steps” diagram
  • Fixed typo: transition-axiom not transition-axion

Expect it most when you expect it least

Homer Simpson saying "It's probably the person you least suspect."
Homer Simpson: philosopher

Yesterday, CrowdStrike released a Preliminary Post-Incident Review of the major outage that happened last week. I’m going to wait until the full post-incident review is released before I do any significant commentary, and I expect we’ll have to wait at least a month for that. But I did want to highlight one section of the doc from the section titled What Happened on July 19, 2024, emphasis mine

On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.

And now, let’s reach way back into the distant past of three weeks ago, when the The Canadian Radio-television and Telecommunications Commission (CRTC) posted an executive summary of a major outage, which I blogged about at the time. Here’s the part I want to call attention to, once again, emphasis mine.

Rogers had initially assessed the risk of this seven-phased process as “High.” However, as changes in prior phases were completed successfully, the risk assessment algorithm downgraded the risk level for the sixth phase of the configuration change to “Low” risk, including the change that caused the July 2022 outage.

In both cases, the engineers had built up confidence over time that the types of production changes they were making were low risk.

When we’re doing something new with a technology, we tend to be much more careful with it, it’s riskier, we’re shaking things out. But, over time, after there haven’t been any issues, we start to gain more trust in the tech, confident that it’s a reliable technology. Our internal perception of the risk adjusts based on our experience, and we come to believe that the risks of these sorts of changes are low. Any organization who concentrates their reliability efforts on action items in the wake of an incident, rather than focusing on the normal work that doesn’t result in incidents, is implicit making this type of claim. The squeaky incident gets the reliability grease. And, indeed, it’s rational to allocate your reliability effort based on your perception of risk. Any change can break us, but we can’t treat every change the same. How could it be otherwise?

The challenge for us is that large incidents are not always preceded by smaller ones, which means that there may be risks in your system that haven’t manifested as minor outages. I think these types of risks are the most dangerous ones of all, precisely because they’re much harder for the organization to see. How are you going to prioritize doing the availability work for a problem that hasn’t bitten you yet, when your smaller incidents demonstrate that you have been bitten by so many other problems?

This means that someone has to hunt for weak signals of risk and advocate for doing the kind of reliability work where there isn’t a pattern of incidents you can point to as justification. The big ones often don’t look like the small ones, and sometimes the only signal you get in advance is a still, small sound.

Dirty writes

For databases that support transactions, there are different types of anomalies that can potentially occur: the higher the isolation level, the more classes of anomalies are eliminated (at a cost of reduced performance).

The anomaly that I always had the hardest time wrapping my head around was the one called a dirty write. This blog post is just to provide a specific example of a dirty write scenario and why it can be problematic. (For another example, see Adrian Coyler’s post).

Here’s how a dirty write is defined in the paper A Critique of ANSI SQL Isolation Levels:

Transaction T1 modifies a data item. Another transaction T2 then further modifies that data item before T1 performs a COMMIT or ROLLBACK.

Here’s an example. Imagine a bowling alley has a database where they keep track of who has checked out a pair of bowling shoes. For historical reasons, they track the left shoe borrower and the right shoe borrower as separate columns in the database:

Once upon a time, they used to let different people check out the left and the right shoe for a given pair. However, they don’t do that anymore: now both shoes must always be checked out by the same person. This is the invariant that must always be preserved in the database.

Consider the following two transactions, where both Alice and Bob are trying to borrow the pair of shoes with id=7.

-- Alice takes the id=7 pair of shoes
 BEGIN TRANSACTION;

UPDATE shoes
   SET left = "alice"
 WHERE id=7;

UPDATE shoes
   SET right= "alice"
 WHERE id=7;

COMMIT TRANSACTION;


-- Bob takes the id=7 pair of shoes
 BEGIN TRANSACTION;

UPDATE shoes
   SET left = "bob"
 WHERE id=7;

UPDATE shoes
   SET right= "bob"
 WHERE id=7;

COMMIT TRANSACTION;

Let’s assume that these two transactions run concurrently. We don’t care about whether Alice or Bob is the one who gets the shoes in the end, as long as the invariant is preserved (both left and right shoes are associated with the same person once the transactions complete).

Now imagine that the operations in these transactions happen to be scheduled so that the individual updates and commits end up running in this order:

  1. SET left = “alice”
  2. SET left = “bob”
  3. SET right = “bob”
  4. Commit Bob’s transaction
  5. SET right = “alice”
  6. Commit Alice’s transactions

I’ve drawn that visually here, where the top part of the diagram shows the operations grouped by Alice/Bob, and the bottom part shows them grouped by left/right column writes.

If the writes actually occur in this order, then the resulting table column will be:

idleftright
7“bob”“alice”
Row 7 after the two transactions complete

After these transactions complete, this row violates our invariant! The dirty write is indicated in red: Bob’s write has clobbered Alice’s write.

Note that dirty writes don’t require any reads to occur during the transactions, nor are they only problematic for rollbacks.