Quick takes on the latest Cloudflare public incident write-up

Cloudflare consistently generates the highest quality public incident writeups of any tech company. Their latest is no exception: Cloudflare incident on November 14, 2024, resulting in lost logs.

I wanted to make some quick observations about how we see some common incident patterns here. All of the quotes are from the original Cloudflare post.

Saturation (overload)

In this case, a misconfiguration in one part of the system caused a cascading overload in another part of the system, which was itself misconfigured.

A very common failure mode in incidents is when the system reaches some limit, where it cannot keep up with the demands put upon it. The blog post uses the term overload, and often you hear the term resource exhaustion. Brendan Gregg uses the term saturation in his USE method for analyzing system performance.

A short temporary misconfiguration lasting just five minutes created a massive overload that took us several hours to fix and recover from.

The resilience engineering research David Woods uses the term saturation in a more general sense, to refer to a system being in a state where it can no longer meet the demands put upon it. The challenge of managing the risk of saturation is a key part of his theory of graceful extensibility.

It’s genuinely surprising how many incidents involve saturation, and how difficult it can be to recover when the system saturates.

This massive increase, resulting in roughly 40 times more buffers, is not something we’ve provisioned Buftee clusters to handle.

For other examples, see some of these other posts I’ve written:

When safety mechanism make things worse (Lorin’s law)

In a previous blog post entitled A conjecture on why reliable systems fail, I wrote:

Once a system reaches a certain level of reliability, most major incidents will involve:

A manual intervention that was intended to mitigate a minor incident, or
Unexpected behavior of a subsystem whose primary purpose was to improve reliability

In this case, it was a failsafe mechanism that enabled the saturation failure mode (emphasis in the original):

This bug essentially informed Logfwdr that no customers had logs configured to be pushed. The team quickly noticed the mistake and reverted the change in under five minutes.

Unfortunately, this first mistake triggered a second, latent bug in Logfwdr itself. A failsafe introduced in the early days of this feature, when traffic was much lower, was configured to “fail open”. This failsafe was designed to protect against a situation when this specific Logfwdr configuration was unavailable (as in this case) by transmitting events for all customers instead of just those who had configured a Logpush job. This was intended to prevent the loss of logs at the expense of sending more logs than strictly necessary when individual hosts were prevented from getting the configuration due to intermittent networking errors, for example.

The eternal paradox.
— Lorin Hochstein (@norootcause.surfingcomplexity.com) 2024-11-28T02:02:05.654Z

Note: I had not yet read the Cloudflare writeup when I originally posted this!

Automated safety mechanisms themselves add complexity, and we are no better at implementing bug-free safety code than we are at implementing bug-free feature code. The difference is that when safety mechanisms go awry, they tend to be much more difficult to deal with, as we saw here.

I’m not opposed to automatic safety mechanisms! For example, I’m a big fan of autoscalers, which are an example of an automated safety mechanism. But it’s important to be aware of there’s a tradeoff: they prevent simpler incidents but enable new, complex incidents. The lesson I take away is that we need to get good at dealing with complex incidents where these safety mechanisms will inevitably contribute to the problem.

Complex interactions (multiple contributing factors)

Unfortunately, this first mistake triggered a second, latent bug in Logfwdr itself.

(Emphasis mine)

I am a card-carrying member of the “no root cause” club: I believe that all complex systems failures result from the interaction of multiple contributors that all had to be present for the incident to occur and to be as severe as it was.

When this failsafe was first introduced, the potential list of customers was smaller than it is today.

In this case, we see the interaction of multiple bugs

Even given this massive overload, our systems would have continued to send logs if not for one additional problem. Remember that Buftee creates a separate buffer for each customer with their logs to be pushed. When Logfwdr began to send event logs for all customers, Buftee began to create buffers for each one as those logs arrived, and each buffer requires resources as well as the bookkeeping to maintain them. This massive increase, resulting in roughly 40 times more buffers, is not something we’ve provisioned Buftee clusters to handle.

(Emphasis mine)

A huge increase in the number of buffers is a failure mode that we had predicted, and had put mechanisms in Buftee to prevent this failure from cascading. Our failure in this case was that we had not configured these mechanisms. Had they been configured correctly, Buftee would not have been overwhelmed.

The two issues that the authors explicitly call out in the (sigh) root causes section are:

A bug that resulted in a blank configuration being provided to Logfwdr
Incorrect Buftee configuration for preventing failure cascades

However, these are also factors that enabled the incident.

The presence of failsafe (fail open) behavior
The increase in size of the potential list of customers over time
Buftee implementation that creates a separate buffer for each customer with logs to be pushed
The amount of load that Buftee was provisioned to handle

I’ve written about the problems with the idea of root cause several times in the past, including:

Keep an eye out for those patterns!

In your own organization, keep an eye out for patterns like saturation, when safety mechanisms make things worse, and complex interactions. They’re easy to miss if you aren’t explicitly looking for them.

The Tortoise and the Hare in Alloy

If you’ve done your share of leetcode-style interviewing, and you’re above a certain age, you may have been asked during a technical screen to write a program that determines if a linked list contains a cycle. If the interviewer was really tough on you, they might have asked how to implement this in O(1) space.

There’s a well-known O(1) algorithm for finding cycles in linked lists, attributed to Robert Floyd, called the tortoise and the hare. I’ve previously written about modeling this algorithm in TLA+. In this post, I’m going to do it in Alloy. Version 6 of Alloy added support for temporal operators, which makes it easier to write TLA+ style models, with the added benefit of Alloy’s visualizer. This was really just an excuse for me to play with these operators.

You can find my model at https://github.com/lorin/tortoise-hare-alloy/

Brief overview of the algorithm

Basic strategy: define two pointers that both start at the head of the list. At each iteration, you advance one of the pointers (the tortoise) by one step, and the other (the hare) by two steps. If the hare reaches the tail of the list, there are no cycles. If the tortoise and the hare ever point to the same node, there’s a cycle.

With that out of the way, let’s model this algorithm in Alloy!

Linked lists

Let’s start off by modeling linked lists. Here’s the basic signature.

sig Node {
    next : lone Node
}

Every linked list has a head. Depending on whether there’s a cycle, it may or may not have a tail. But we do know that it has at most one tail.

one sig Head in Node {}
lone sig Tail in Node {}

Let’s add a fact about the head, using Alloy reflexive transitive closure operator (*).

fact "all nodes are reachable from the head" {
    Node in Head.*next
}

You can think of Head.*next as meaning “every node that is reachable from Head, including Head itself”.

Finally, we’ll add a fact about the tail:

fact "the tail is the only node without a successor" {
    all n : Node | no n.next <=> n = Tail
}

We can now use Alloy to generate some instances for us to look at. Here’s how to tell Alloy to generate an instance of the model that contains 5 nodes and has a tail:

acyclic: run { some Tail } for exactly 5 Node

This is what we see in the visualizer:

We can also tell Alloy to generate an instance without a tail:

cycle: run { no Tail } for exactly 5 Node

Here are three different instances without tails:

Tortoise and hare tokens

The tortoise and the hare are pointers to the nodes. However, I like to think of them like tokens moving along a game board, so I called them tokens. Here’s how I modeled them:

abstract sig Token {
    var at : Node
}

one sig Tortoise, Hare extends Token {}

Note that the Token.at field has a var prefix. That’s new in Alloy 6, and it means that the field can change over time.

As in TLA+, we need to specify the initial state for variables that change over time. Both tokens start at the head, which we can express as a fact.

fact init {
    Token.at in Head
}

Next, as in TLA+, we need to model how variables change over time.

Here’s the predicate that’s true whenever the tortoise and hare take a step. Alloy uses the same primed variable notation as TLA+ to refer to “the value of the variable in the next state”. In TLA+, we’d call this kind of predicate an action, because it contains primed variables:

pred move {
    Tortoise.at' = advance[Tortoise.at]
    Hare.at' = advance[advance[Hare.at]]
}

This predicate uses a helper function I wrote called advance which takes a pointer to a node and advances to the next node, unless it’s at the tail, in which case it stays where it is:

fun advance[n : Node] : Node {
    (n = Tail) implies n else n.next
}

We can run our model like this, using the always temporal operator to indicate that the move predicate is true at every step.

run {always move} for exactly 5 Node

Here’s a screenshot of Alloy’s visualizer UI for one of the traces. You can see that there are 5 states in the trace, and it’s currently displaying state 2.

Here are all of the states in the trace:

It’s confusing to follow what’s happening over time because Alloy re-arranges the layout of the nodes at the different steps. We’ll see later on this post how we can configure the visualizer so to make it easier to follow.

Output of the algorithm

So far we’ve modeled the movement of the tortoise and hare tokens, but we haven’t fully modeled the algorithm, because we haven’t modeled the return value, which is supposed to indicate whether there’s a cycle or not.

I modeled the return value as a Result signature, like this:

abstract sig CycleStatus {}
one sig Cycle, NoCycle, Running extends CycleStatus {}

var one sig Result in CycleStatus {}

You can think of Result as being like an enum, which can take on values of either Cycle, NoCycle, or Running.

Note that Result has a var in front, meaning it’s a variable that can change over time. It starts off in the Running state, so let’s augment our init fact:

fact init {
    Token.at in Head
    Result = Running
}

Let’s also define termination for this algorithm. Our algorithm is done when Result is either Cycle or NoCycle. Once it’s done, we no longer need to advance the tortoise and hare pointers. We also don’t change the result once the program has terminated.

pred done {
    Result in Cycle + NoCycle
    Tortoise.at' = Tortoise.at
    Hare.at' = Hare.at
    Result' = Result
}

We need to update the move action so that it updates our Result variable. We also don’t want the move action to be enabled when the algorithm is done (no need to keep advancing the pointers), so we’ll add an enabling condition. As a result, move now looks like this:

pred move {
  // enabling condition
  Result = Running

  // advance the pointers
  Tortoise.at' = advance[Tortoise.at]
  Hare.at' = advance[advance[Hare.at]]

  // update Result if the hare has reached the tail 
  // or tortoise and hare are at the same node
  Hare.at' = Tail implies Result' = NoCycle
                  else    Hare.at' = Tortoise.at' implies Result' = Cycle
                                                  else    Result' = Result
}

The syntax for updating the Result isn’t as nice as in TLA+: Alloy doesn’t have a case statement. It doesn’t even have an if statement: instead we use implies/else to achieve if/else behavior.

We can now define the full spec like this:

pred spec {
    always (
        move or
        done
    )
}

And then we can ask the analyzer to generate traces when spec is true, like this:

example: run { spec } for exactly 5 Node

Improving the visualization

Finally, let’s make the visualization nicer to look at. I didn’t want the tortoise and hare to be rendered by the visualizer as objects separate from the nodes. Instead I wanted them to be annotations on the node.

The analyzer will let you represent fields as attributes, so I could modify the Node signature to add a new field that contains which tokens are currently occupying the node:

sig Node {
    next : lone Node,
    var tokens : set Token // <- new field (I didn't do this)
}

But I didn’t want to add a field to my model.

However, Alloy lets you define a function that returns a Node -> Token relation, and then the visualizer will let you treat this function like it’s a field. This relation is just the inverse of the at relationship that we defined on the Token signature:

// This is so the visualizer can show the tokens as attributes 
// on the nodes
fun tokens[] : Node -> Token {
    ~at
}

Now, in the theme panel of the visualizer, there’s a relation named $tokens.

You can also rename things. In particular, I renamed Tortoise to 🐢 and Hare to 🐇 as well as making them attributes. Here’s a screenshot after the changes:

Here’s an example trace when there’s no cycle. Note how the (Result) changes from Running to NoCycle

Much nicer!

Checking a property

Does our program always terminate? We can check like this:

assert terminates {
    spec => eventually done
}

check terminates for  exactly 5 Node

The analyzer output looks like this:

Solving...
No counterexample found. Assertion may be valid. 9ms.

In general, this sort of check doesn’t guarantee that our program terminates, because our model might be too small.

We can also check for correctness. Here’s how we can ask Alloy to check that there is a cycle in the list if and only if the program eventually terminates with Result=Cycle.

pred has_cycle {
    some n : Node | n in n.^next
}

assert correctness {
     spec => (has_cycle <=> eventually Result=Cycle)
}

check correctness for exactly 5 Node

Note that we’re using the transitive closure (^) in the has_cycle predicate to check if there’s a cycle. That predicate says: “there is some node in that is reachable from itself”.

TTR: the out-of-control metric

I’m currently reading The Machine That Changed The World. This is a book written back in 1990 comparing Toyota’s approach to automobile manufacturing to the approach used by American car manufacturers. It’s one of the earlier books that popularized the concept of lean manufacturing in the United States.

The software world has drawn a lot of inspiration from lean manufacturing over the past two decades, as is clear from the titles of influential software books such as Implementing Lean Software Development by Tom Poppendieck and Mary Poppendieck (2006), The Principles of Product Development Flow: Second Generation Lean Product Development by Don Reinersten (2009), The Lean Startup by Eric Ries (2011), Lean UX by Jeff Gothelf and Josh Sieden (first published in 2013), and Accelerate: The Science of Lean Software by Nicole Forsgren PhD, Jez Humble, and Gene Kim (2018). Another signal is the proliferation of Kanban boards, which are a concept taken from Toyota. I’ve also seen continuous delivery compared to single-piece flow from lean manufacturing, although I suspect that’s more a case of convergent evolution than borrowing.

In The Machine That Changed The World, the authors mention in passing how Toyota uses the five-whys problem identification technique. I had forgotten that five whys has its origins in manufacturing. This post isn’t about five whys, but it is about how applying concepts from manufacturing to incidents can lead us astray, because of assumptions that turn out to be invalid. For that, I’m going to turn to W. Edwards Deming and the idea of statistical control.

Deming & Statistical control

Deming is the famous American statistician who had enormous influence on the Japanese manufacturing industry in the second half of the twentieth century. My favorite book of his is Out of the Crisis, originally published in 1982, which I highly recommend.

One of the topics Deming wrote on was about a process being under statistical control, with the focus of his book being on manufacturing processes in particular. For example, imagine you’re tracking some metric of interest (e.g., defect rate) for a manufacturing process.

(Note: I have no experience in the manufacturing domain, so you should treat this is as a stylized, cartoon-ish view of things).

Deming argued that when a process is under statistical control, focusing on individual defects, or even days, where the defects are higher than average, is a mistake. To make this more concrete, you can compute an upper control limit and lower control limit based on the statistics of the observed data. There is variation inherent in the process, and focusing on the individual data points that happen to be higher than the average won’t lead to actual improvements.

The process with computed upper and lower control limits. This graph is sometimes called as a control chart.

Instead, in order to make an improvement, you need to make a change to the overall system. This is where Toyota’s five-whys would come in, where you’d identify a root cause, a systemic issue behind why the average rate is as high at is. Once you identified a root cause, you’d apply what Deming called the Plan-Do-Check-Act cycle, where you’d come up with an intervention, apply it, observe whether the intervention has actually achieved the desired improvement, and then react accordingly.

I think people have attempted to apply these concepts to improving availability, where time-to-resolve (TTR) is the control metric. But it doesn’t work the way it does in manufacturing. And the reason it doesn’t has everything to do with the idea of statistical control.

Out of control

Now, let’s imagine a control chart that looks a little different.

In the chart above, there are multiple points that are well outside the control limits. This is a process that is not under statistical control.

Deming notes that, when a process is not under statistical control, statistics associated with the process are meaningless:

Students are not warned in classes nor in the books that for analytic purposes (such as to improve a process), distributions and calculations of mean, mode, standard deviation, chi-square, t-test, etc. serve no useful purpose for improvement of a process unless the data were produced in a state of statistical control. – W. Edwards Deming, Out of the Crisis

Now, I’m willing to bet that if you were to draw a control chart for the time-to-resolve (TTR) metric for your incidents, it would look a lot more like the second control chart than the first one, that you’d have a number of incidents whose TTRs are well outside of the upper control limit.

The reason I feel confident saying this is because when an incident is happening, your system is out of control. This actually is a decent rough-and-ready definition of an incident: an event when your system goes out of control.

Time-to-resolve is a measure of how long your system was out of control. But because your system was out of control, then it isn’t a meaningful metric to perform statistical analysis on. As per the Deming quote above, mean-time-to-resolve (MTTR) serves no useful purpose for improvement.

Anyone who does operations work will surely sympathize with the concept that “a system during an incident is not under statistical control”. Incidents are often chaotic affairs, and individual events (a chance ordering by the thread scheduler, a responder who happens to remember a recent Slack message, someone with important knowledge happens to be on PTO) can mean the difference between a diagnosis that takes minutes versus one that takes hours.

As John Allspaw likes to say, a large TTR cannot distinguish between a complex incident handled well and a simple incident handled poorly. There are too many factors that can influence TTR to conclude anything useful from the metric alone.

Conclusion

To recap:

When a system is out of control, statistical analysis on system metrics are useless as a signal for improving the system.
Incidents are, by definition, events when the system is out control.

TTR, in particular, is a metric that only applies when the system is out of control. It’s really just a measure of how long the system was out of control.

Now, this doesn’t mean that we should throw up our hands and say “we can’t do anything to improve our ability to resolve incidents.” It just means that we need to let go of a metrics-based approach.

Think back to Allspaw’s observation: was your recent long incident a complex one handled well or a simple one handled poorly? How would you determine that? What questions would you ask?

Reading the Generalized Isolation Level Definitions paper with Alloy

My last few blog posts have been about how I used TLA+ to gain a better understanding of database transaction consistency models. This post will be in the same spirit, but I’ll be using a different modeling tool: Alloy.

Like TLA+, Alloy is a modeling language based on first-order logic. However, Alloy’s syntax is quite different: defining entities in Alloy feels like defining classes in an object-oriented language, including the ability to define subtypes. It has first class support for relations, which makes it very easy to do database-style joins. It also has a very nifty visualization tool, which can help when incrementally defining a model.

I’m going to demonstrate here how to use it to model and visualize database transaction execution histories, based on the paper Generalized Isolation Level Definitions by Atul Adya, Barbara Liskov and Patrick O’Neil. This is a shorter version of Adya’s dissertation work, which is referenced frequently on the Jepsen consistency models pages.

There’s too much detail in the models to cover in this blog post, so I’m just going to touch on some topics. If you’re interested in more details, all of my models are in my https://github.com/lorin/transactions-alloy repository.

Modeling entities in Alloy

Let’s start with a model with the following entiteis:

objects
transactions (which can commit or abort)
events (read, write, commit, abort)

The diagram below shows some of the different entities in our Alloy model of the paper.

You can think of the above like a class diagram, with the arrows indicating parent classes.

Objects and transactions

Objects and transactions are very simple in our model.

sig Object {}

abstract sig Transaction {}

sig AbortedTransaction, CommittedTransaction extends Transaction {}

The word sig is a keyword in Alloy that means signature. Above I’ve defined transactions as abstract, which means that they can only be concretely instantiated by subtypes.

I have two subtypes, one for aborted transactions, one for committed transactions.

Events

Here are the signatures for the different events:

abstract sig Event {
    tr: Transaction,
    eo: set Event, // event order (partial ordering of events)
    tnext: lone Event // next event in the transaction
}

// Last event in a transaction
abstract sig FinalEvent extends Event {}

sig Commit extends FinalEvent {}

sig Abort extends FinalEvent {}

sig Write extends Event {
    obj: Object,
    wn : WriteNumber
}

sig Read extends Event {
    sees: Write // operation that did the write
}

The Event signature has three fields:

tr – the transaction associated with the event
eo – an event ordering relationship on events
tnext – the next event in the transaction

Reads and writes each have additional fielsd.

Reads

A read has one custom field, sees. In our model, a read operation “sees” a specific write operation. We can follow that relationship to identify the object and the value being written.

Writes and write numbers

The Write event signature has two additional fields:

obj – the object being written
wn – the write number

Following Adya, we model the value of a write by a (transaction, write number) pair. Every time we write an object, we increment the write number. For example, if there are multiple writes to object X, the first write has write number 1, the second has write number 2, and so on.

We could model WriteNumber entities as numbers, but we don’t need full-fledged numbers for this behavior. We just need an entity that has an order defined (e.g., it has to have a “first” element, and each element has to have a “next” element). We can use the ordering module to specify an ordering on WriteNumber:

open util/ordering[WriteNumber] as wo

sig WriteNumber {}

Visualizing

We can use the Alloy visualizer to generate a visualization of our model. To do that, we just need to use the run command. Here’s the simplest way to invoke that command:

run {}

Here’s what the generated output looks like:

Visualization of a history with the default theme

Yipe, that’s messy! We can clean this up a lot by configuring the theme in the visualizer. Here’s the same graph, with different theme settings. I renamed several things (e.g., “Ta” instead of “AbortedTransaction”), I had some relationships I didn’t care about (eo), and I showed some relationships as attributes instead of arcs (e.g., tr).

Visualization after customizing the theme

The diagram above shows two transactions (Ta, Tc). Transaction Ta has a read operation (Read) and a write operation (Write0). Transaction Tc has a write operation. (Write1).

Constraining the model with facts

The history above doesn’t make much sense:

tnext is supposed to represent “next event in the transaction”, but in each transaction, tnext has loops in it
Ta belongs to the set of aborted transactions, but it doesn’t have an abort event
Tc belongs to the set of committed transactions, but it doesn’t have a commit event

We need to add constraints to our model so that it doesn’t generate nonsensical histories. We do this in Alloy by adding facts.

For example, here are some facts:

// helper function that returns set of events associated with a transaction
fun events[t : Transaction] : set Event {
    tr.t
}

fact "all transactions contain exactly one final event" {
    all t : Transaction | one events[t] & FinalEvent
}

fact "nothing comes after a final event" {
    no FinalEvent.tnext
}

fact "committed transactions contain a commit" {
    all t : CommittedTransaction | some Commit & events[t]
}

fact "aborted transactions contain an abort" {
    all t : AbortedTransaction | some Abort & events[t]
}

Now our committed transactions will always have a commit! However, the facts above aren’t sufficient, as this visualization shows:

I won’t repeat all of the facts here, you can see them in the transactions.als file.

Here’s one last example of a fact, encoded from the Adya paper:

The corresponding fact in my model is:

/**
 * If an event wi (xi:m) is followed by ri (xj) without an intervening event wi (xi:n) in E, xj must be xi:m
 */
fact "transaction must read its own writes" {
    all T : Transaction, w : T.events & Write, r : T.events & Read | ({
            w.obj = r.sees.obj
            w->r in eo
            no v : T.events & Write | v.obj = r.sees.obj and (w->v + v->r) in eo
    } => r.sees = w)
}

With the facts specified, our generated histories no longer look absurd. Here’s an example:

Mind you, this still looks like an incorrect history: we have two transactions (Tc1, Tc0) that commit after reading a write from an aborted transaction (Ta).

Installed versions

The paper introduces the concept of a version of an object. A version is installed by a committed transaction that contains a write.

I modeled versions like this.

open util/ordering[VersionNumber] as vo

sig VersionNumber {}

// installed (committed) versions
sig Version {
    obj: Object,
    tr: one CommittedTransaction,
    vn: VersionNumber
}

Dependencies

Once we have versions defined, we can model the dependencies that Adya defines in his paper. For example, here’s how I defined the directly write-depends relationship, which Adya calls ww in his diagrams.

fun ww[] : CommittedTransaction -> CommittedTransaction {
    { disj Ti, Tj : CommittedTransaction | some v1 : installs[Ti], v2 : installs[Tj] | {
        same_object[v1, v2]
        v1.tr = Ti
        v2.tr = Tj
        next_version[v1] = v2
        }
    }
}

Visualizing counterexamples

Here’s a final visualizing example of visualizations. The paper A Critique of ANSI SQL Isolation Levels by Berenson et al. provides some formal definition of different interpretations of the ANSI SQL specification. One of them they call “anomaly serializable (strict interpretation)”.

We can build a model this interpretation in Alloy. Here’s part of it just to give you a sense of what it looks like, see my bbg.als file for the complete model:

pred AnomalySerializableStrict {
    not A1
    not A2
    not A3
}

/**
 * A1: w1[x]...r2[x]...(a1 and c2 in any order)
 */
pred A1 {
    some T1 : AbortedTransaction,
         T2 : CommittedTransaction,
         w1: Write & events[T1],
         r2 : Read & events[T2],
         a1 : Abort & events[T1] | {

       w1->r2 in teo
       r2.sees = w1
       // r2 has to happen before T1 aborts
       r2->a1 in teo
    }
}

...

And then we can ask Alloy to check whether AnomalySerializableStrict implies Adya’s definition of serializability (which he calls isolation level PL-3).

Here’s how I asked Alloy to check this:

assert anomaly_serializable_strict_implies_PL3 {
    always_read_most_recent_write => (b/AnomalySerializableStrict => PL3)
}

check anomaly_serializable_strict_implies_PL3
for 8 but exactly 3 Transaction, exactly 2 Object, exactly 1 PredicateRead, exactly 1 Predicate

Alloy tells me that the assertion is invalid, and shows the following counterexample:

This shows a history that satisfies the anomaly serializable (strict) specificaiton, but not Adya’s PL-3. Note that the Alloy visualizer has generated a direct serialization graph (DSG) in the bottom left-hand corner, which contains a cycle.

Predicate reads

This counterexample involves predicate reads, which I hadn’t shown modeled before, but they look like this:

// transactions.als

abstract sig Predicate {}

abstract sig PredicateRead extends Event {
    p : Predicate,
    objs : set Object
}

// adya.als

sig VsetPredicateRead extends PredicateRead {
    vset : set Version
}

sig VsetPredicate extends Predicate {
    matches : set Version
}

A predicate read is a read that returns a set of objects.

In Adya’s model, a predicate read takes as input a version set (a version for each object) and then determines which objects should be included in the read based on whether or not the versions match the predicate:

fact "objects in predicate read are the objects that match in the version set" {
    all pread : VsetPredicateRead |
        pread.objs = (pread.vset & pread.p.matches).obj
}

One more counterexample

We can also use Alloy to show us when a transaction would be permitted by Adya’s PL-3, but forbidden by the broad interpretation of anomaly serializability:

check PL3_implies_anomaly_serializable_broad
for 8 but exactly 3 Transaction, exactly 2 Object, exactly 1 PredicateRead, exactly 1 Predicate

assert PL3_implies_anomaly_serializable_broad {
    always_read_most_recent_write => (PL3 => b/AnomalySerializableBroad)
}

The example above shows the “gnext” relation, which yields a total order across events.

The resulting counterexample is two aborted transactions! Those are trivially serializable, but it is ruled out by the broad definition, specifically, P3:

Playing with Alloy

I encourage you to try Alloy out. There’s a great Visual Studio code plugin that will let you execute your Alloy models from VS Code, that’s what I’ve been using. It’s very easy to get started with Alloy, because you can get it to generate visualizations for you out of the simplest models.

For more resources, Hillel Wayne has written a set of Alloy docs that I often turn to. There’s even an entire book on Alloy written by its creator. (Confusingly, though, the book does not have the word Alloy in the name).

Extending MVCC to be serializable, in TLA+

In the previous blog post, we saw how a transaction isolation strategy built on multi-version concurrency control (MVCC) does not implement the serializable isolation level. Instead, it implements a weaker isolation level called snapshot isolation. In this post, I’ll discuss how that MVCC model can be extended in order to achieve serializability, based on work published by Michael Cahill, Uwe Röhm, and Alan Fekete.

You can find the model I wrote in the https://github.com/lorin/snapshot-isolation-tla repo, in the SSI module (source, pdf).

A quick note on conventions

In this post, I denote read of x=1 as r[x,1]. This means a transaction read the object x which returned a value of 1. As I mentioned in the previous post, you can imagine a read as being the following SQL statement:

SELECT v FROM obj WHERE k='x';

Similarly, I denote a write of y←2 as w[y,2]. This means a transaction wrote the object y with a value of 2. You can imagine this as:

UPDATE obj SET v=2 WHERE k='y';

Finally, I’ll assume that there’s an initial transaction (T0) which sets the values of all of the objects to 0, and has committed before any other transaction starts.

We assume this transaction always precedes all other transactions

Background

The SQL isolation levels and phenomena

The ANSI/ISO SQL standard defines four transaction isolation levels: read uncommitted, read committed, repeatable read, and serializable. The standard defines the isolation levels in terms of the phenomena they prevent. For example, the dirty read phenomenon is when one transaction reads a write done by a concurrent transaction that has not yet committed. Phenomena are dangerous because they may violate a software developer’s assumptions about how the database will behave, leading to software that behaves incorrectly.

Problems with the standard and a new isolation level

Berenson et al. noted that the standard’s wording is ambiguous, and of the two possible interpretations of the definitions, one was incorrect (permitting invalid execution histories) and the other was overly strict (proscribing valid execution histories).

The overly strict definition implicitly assumed that concurrency control would be implemented using locking, and this ruled out valid implementations based on alternate schemes, in particular, multi-version concurrency control. They also proposed a new isolation level: snapshot isolation.

Formalizing phenomena and anti-dependencies

In his PhD dissertation work, Adya introduced a new formalization for reasoning about transaction isolation. The formalism is based on a graph of direct dependencies between transactions.

One type of dependency Adya introduced is called an anti-dependency, which is critical to the difference between snapshot isolation and serializable.

An anti-dependency between two concurrent transactions is when one read an object and the other writes the object with a different value, for example:

T1 is said to have an anti-dependency on T2: T1 must come before T2 in a serialization:

If T2 is sequenced before T1, then the read will not match the most recent write. Therefore, T1 must come before T2.

In dependency graphs, anti-dependencies are labeled with rw because the transaction which does the read must be sequenced before the transaction that does the write, as shown above.

Adya demonstrated that for an implementation that supports snapshot isolation to generate execution histories that are not serializable, there must be a cycle in the dependency graph that includes an anti-dependency.

Non-serializable execution histories in snapshot isolation

In the paper Making Snapshot Isolation Serializable, Fekete et al. further narrowed the conditions under which snapshot isolation could lead to a non-serializable execution history, by proving the following theorem:

THEOREM 2.1. Suppose H is a multiversion history produced under Snapshot Isolation that is not serializable. Then there is at least one cycle in the serialization graph DSG(H), and we claim that in every cycle there are three consecutive transactions Ti.1, Ti.2, Ti.3 (where it is possible that Ti.1 and Ti.3 are the same transaction) such that Ti.1 and Ti.2 are concurrent, with an edge Ti.1 → Ti.2, and Ti.2 and Ti.3 are concurrent with an edge Ti.2 → Ti.3.

They also note:

By Lemma 2.3, both concurrent edges whose existence is asserted must be anti-dependencies:Ti.1→ Ti.2 and Ti.2→ Ti.3.

This means that one of the following two patterns must always be present in a snapshot isolation history that is not serializable:

Non-serializable snapshot isolation histories must contain one of these as subgraphs in the dependency graph

Modifying MVCC to avoid non-serializable histories

Cahill et al. proposed a modification to MVCC that can dynamically identify potential problematic transactions that could lead to non-serializable histories, and abort them. By aborting these transactions, the resulting algorithm guarantees serializability.

As Fekete et al. proved, under snapshot isolation, cycles can only occur if there exists a transaction which contains an incoming anti-dependency edge and an outgoing anti-dependency edge, which they call pivot transactions.

Their approach is to identify and abort pivot transactions: if an active transaction contains both an outgoing and an incoming anti-dependency, the transaction is aborted. Note that this is a conservative algorithm: some of the transactions that it aborts may have still resulted in serializable execution histories. But it does guarantee serializability.

Their modification to MVCC involves some additional bookkeeping:

Reads performed by each transaction
Which transactions have outgoing anti-dependencies
Which transactions have incoming anti-dependencies

The tracking of reads is necessary to identify the presence of anti-dependencies, since an anti-dependency always involve a read (outgoing dependency edge) and a write (incoming dependency edge).

Extending our MVCC TLA+ model for serializability

Adding variables

I created a new module called SSI, which stands for Serializable Snapshot Isolation. I extended the MVCC model to add three variables to implement the additional bookkeeping required by the Cahill et al. algorithm. MVCC already tracks which objects are written by each transaction, but we need to now also track reads.

rds – which objects are read by which transactions
outc – set of transactions that have outbound anti-dependencies
inc – set of transactions that have inbound anti-dependencies

TLA+ is untyped (unless you’re using Apalache), but we can represent type information by defining a type invariant (above, called TypeOkS). Defining this is useful both for the reader, and because we can check that this holds with the TLC model checker.

Changes in behavior: new abort opportunities

Here’s how the Next action in MVCC compares to the equivalent in SSI.

Note: Because extending the MVCC module brings all of the MVCC names into scope, I had to create new names for each of the equivalent actions in SSI, I did this by appending an S (e.g., StartTransactionS, DeadlockDetectionS).

In our original MVCC implementation, reads and commits always succeeded. Now, it’s possible for an attempted read or an attempted to commit to result in aborts as well, so we needed an action for this, which I called AbortRdS.

Commits can now also fail, so instead of having a single-step Commit action, we now have a BeginCommit action, which will complete successfully by an EndCommit action, or fail with an abort by the AbortCommit action. Writes can also now abort due to the potential for introducing pivot transactions.

Finding aborts with the model checker

Here’s how I used the TLC model checker to generate witnesses of the new abort behaviors:

Aborted reads

To get the model checker to generate a trace for an aborted read, I defined the following invariant in the MCSSI.tla file:

Then I specified it as an invariant to check in the model checker in the MCSSI.cfg file:

INVARIANT 
    NeverAbortsRead

Because aborted reads can, indeed, happen, the model checker returned an error, with the following error trace:

The resulting trace looks like this, with the red arrows indicating the anti-dependencies.

Aborted commit

Similarly, we can use the model checker to identify scenarios where it a commit would fail, by specifying the following invariant:

The checker finds the following violation of that invariant:

While T2 is in the process of committing, T1 performs a read which turns T2 into a pivot transaction. This results in T2 aborting.

Checking serializability using refinement mapping

Just like we did previously with MVCC, we can define a refinement mapping from our SSI spec to our Serializability spec. You can find it in the SSIRefinement module (source, pdf). It’s almost identical to the MVCCRefinement module (source, pdf), with some minor modifications to handle the new abort scenarios.

The main difference is that now the refinement mapping should actually hold, because SSI ensures serializability! I wasn’t able to find a counterexample when I ran the model checker against the refinement mapping, so that gave me some confidence in my model. Of course, that doesn’t prove that my implementation is correct. But it’s good enough for a learning exercise.

Coda: on extending TLA+ specifications

Serializable Snapshot Isolation provides us with a nice example of when we can extend an existing specification rather than create a new one from scratch.

Even so, it’s still a fair amount of work to extend an existing specification. I suspect it would have been less work to take a copy-paste-and-modify approach rather than extending it. Still, I found it a useful exercise in learning how to modify a specification by extending it.