MTTR: When sample means and power laws combine, trouble follows

Think back on all of the availability-impacting incidents that have occurred in your organization over some decent-sized period, maybe a year or more. Is the majority of the overall availability impact due to:

  1. A large number of shorter-duration incidents, or
  2. A small number of longer-duration incidents?

If you answered (2), then this suggests that the time-to-resolve (TTR) incident metric in your organization exhibits a power law distribution. This fact has implications for how good the sample mean of a collection of observed incidents approximates the population (true) mean of the underlying TTR process. This sample mean is commonly referred to as the infamous MTTR (mean-time-to-resolve) metric.

Now, I personally believe that incidents durations are power-law distributed, and, consequently I believe that observed MTTR trends convey no useful information at all.

But rather than simply asserting that power-law distributions render MTTR useless, I wanted to play with some power-law-distributed data to validate what I had read about power laws. And, to be honest, I wanted an excuse to play with Jupyter notebooks.

Caveat: I’m not a statistician, meaning I’m not a domain expert here. However, hopefully this analysis is simple enough that I haven’t gotten myself into too much trouble here.

You can find my Jupyter notebook here: https://github.com/lorin/mttr/blob/main/power-law.ipynb. You can view it with the images rendered here: https://nbviewer.org/github/lorin/mttr/blob/main/power-law.ipynb

Thinking through a toy example

This post is going to focus entirely on a toy example: I’m going to use entirely made-up data. Let’s consider two candidate distributions for TTR data: a normal tailed distribution and fat-tailed distribution.

The two distributions I used were the Poisson distribution and the Zeta distribution, with both distributions having the same population mean. I arbitrarily picked 15 minutes as the population mean for each distribution.

Poisson

For the normal-tailed distribution, the natural pick would be the Gaussian (aka normal) distribution. But the Gaussian is defined on (-\infty, +\infty), and I want a distribution where the probability of a negative TTR is zero. I could just truncate the Gaussian, but instead I decided to go with the Poisson distribution instead, because it doesn’t go negative. Note that the Poisson distribution is discrete where the Gaussian is continuous. The Poisson also has the nice property that it is characterized by only a single parameter (which is called μ in scipy.stats.poisson). This makes this exercise simpler, because there’s one fewer parameter I need to deal with. For Poisson, the mean is equal to μ, so we have μ=15 as our parameter)

Zeta (Zipf)

For the fat-tailed distribution, the natural pick would be the Pareto distribution. The Pareto distribution is zero for all negative values, so we don’t have the conceptual problem that we had with the Gaussian. But it felt a little strange to use a discrete distribution for the normal-tailed case and a continuous distribution for the fat-tailed case. So I decided to go with a discrete power-law distribution, the zeta distribution. This also goes by the name Zipf distribution (of Zipf’s law fame), which is what SciPy calls it: scipy.stats.zipf. This distribution has a single parameter, which SciPy calls a.

I wanted to find the parameter a such that the mean of the distribution was 15. The challenge is that the mean for the zeta distribution is (assume Z is a zeta-distributed random variable):

E[Z] = \frac{\zeta(a-1)}{\zeta(a)}

where ζ(a) is the Riemann zeta function, which is defined as:

\zeta(a) = \displaystyle\sum_{n=1}^{\infty} \frac{1}{n^a}

Because of this, I couldn’t just analytically solve for a. What I ended up doing was manually executing the equivalent of a binary search to find a value for the parameter a such that the mean for the distribution was close to 15, which was good enough for my purposes.

You can use the stats method to get the mean of a distribution:

>>> from scipy.stats import zipf
>>> a = 2.04251395975
>>> zipf.stats(a, moments='m')
np.float64(15.000000015276777)

Close enough!

Visualizing the distributions

Since these two distributions are discrete, their distributions are characterized by what is called probability mass functions (pmf). The nice thing about pmfs, is that you can interpret the y-axis values directly as probabilities, whereas with the continuous case, you are working with probability density functions (pdfs), where you need to integrate under the pdf over an interval to get a probability.

Here’s what the two pmfs look like. I plotted them with the same scales to make it easier to compare them visually.

Click to embiggen

Note how the Poisson pmf has its peak around the mean (15 minutes), where the zeta pmf has a peak at 1 minute and then decreases monotonically.

I truncated the x-axis to 40 minutes, but both distributions theoretically extend out to +∞. Let’s take a look at what the distributions looks like further out into the tail, in a window centered around 120 minutes:

Click to embiggen

I didn’t plot the two graphs on the same scale in this case, due to the enormous difference in magnitudes. For the Poisson distribution, an incident of 100 minutes has a probability on the order of 10^-47, which is fantastically small. For the zeta distribution, an incident of 100 minutes is on the order of 10^-5, which is 42 orders of magnitude more likely than the Poisson!

You can also see how the zeta distribution falls off much more slowly than the Poisson.

Looking at random samples

I generated 5,000 random samples from each distribution to get a feel for what the distributions look like.

Click to embiggen

Once again, I’ve used different scales because the ranges are so different. I also plotted them both on a log-scale, this time using the same scale for both.

Click to embiggen

Samples from the Poisson distribution are densest near the population mean (15 minutes). There are a few outliers, but that don’t deviate too far away from the mean.

Samples from the zeta distribution are densest around 1 minute, but spill much further out.

MTTR trends over time

Now, let’s consider that we look at the MTTR (sample mean for our TTR data) at regular intervals, where you can imagine a regular interval to be monthly or quarterly or yearly, or whatever cadence your org uses.

To be concrete, I assumed that we have 25 data points per interval. So, for simplicity, we’re assuming that we have exactly 25 incidents per interval, and we’re computing the MTTR at each interval, which is the average of the TTRs of those 25 samples. With this in mind, let’s look at what the MTTR looks like over time.

I’ll use the same axis for both graphs. I’ve drawn the population mean (15 minutes) as a dashed red line.

Click to embiggen

Which one of these looks more like your incident data?

What does the trend convey?

Let’s take a closer look at the data from the zeta samples. Remember that each point represents an MTTR from 25 data points collected over some interval of interest. Let’s pretend these data points represent months, so let’s look at the first 12 of them:

Click to embiggen

I imagine that someone looking at this MTTR would come to the conclusion that:

  • We did very well in months 1 and 2 (MTTR below 5 minutes!)
  • We saw a regression in month 3 (MTTR about 12 minutes)
  • We saw steady improvements from months 3 to 8 (went back to under 5 minutes)
  • Things didn’t really change from months 8 to 11 (MTTR stayed at or under 5 minutes)
  • We got much, much worse in month 12

The problem with this conclusion is that it’s completely wrong: every single data point was drawn from the same distribution, which means that this graph is misleading: the graph implies changes over time to your TTR distribution which are not there.

Are incidents power-law distributed? What does your data tell you?

This post was just a toy exercise using synthetic data, but I hope it demonstrates how, if incident durations are power-law distributed, looking at MTTR trends over time will lead to incorrect conclusions.

Now, I believe that incident durations are power-law distributed, but I haven’t provided any evidence for that in this post. If you have access to internal incident data in your organization, I encourage that you take a look at the distribution to see whether there’s evidence that it is power-law distributed: is most of the total availability impact from larger incidents, or from smaller ones?

Further reading

Here are a few other sources on this topic that I’ve found insightful.

Incident Metrics in SRE by Štěpán Davidovič

Štěpán Davidovič of Google did an analysis where he looked at real incident data to examine how incident data was distributed, as well as doing Monte Carlo simulations to see whether it was possible to identify whether a change in the true mean (e.g., an intervention that improved TTR) could be identified from the data, and also to see how likely it was to conclude that the system had changed when it actually hadn’t.

He observed that the data doesn’t appear normally distributed. Similar to the analysis I did here, he showed that MTTR trends could mislead people into believing that change had occurred when it hadn’t:

We’ve learned that even without any intentional change to the incident durations, many simulated universes would make you believe that the MTTR got much shorter—or much longer—without any structural change. If you can’t tell when things aren’t changing,
you’ll have a hard time telling when they do.

He published his work this as a freely available O’Reilly mini-book, available here: https://sre.google/resources/practices-and-processes/incident-metrics-in-sre/

Doing statistics under fat tails by Nassim Nicholas Taleb

The author Nassim Nicholas Taleb is… let’s say… unpopular in the resilience engineering community. As an example, see Casey Rosenthal’s post Antifragility is a Fragile Concept. I think Taleb’s caustic style does him no favors. However, I have found him to be the best source on the shortcomings of using common statistical techniques when sampling from fat tailed distributions.

For example, in his paper How Much Data Do You Need? A Pre-asymptotic Metric for Fat-tailedness, he obtains a result that shows that, if you want to estimate the population mean from the sample mean when sampling from a power-law distribution (in this case, an 80/20 Pareto distribution), you need more than 10^9 times more observations than you would compared to if you were sampling from a Gaussian distribution.

Now, if you have more than one billion times more incidents than the average organization, then MTTR may provide you with a reasonable estimate of the true mean of your underlying TTR distribution! But, I suspect that most readers don’t fall into the over-a-billion-incidents bucket. (If you, do please reach out, I’d love to hear about this!)

Taleb maintains a collection of his papers on this topic here: https://www.fooledbyrandomness.com/FatTails.html

Moving past shallow incident data by John Allspaw

If not MTTR, then what? The canonical answer to this question is this blog post by John Allspaw from back in 2018, entitled Moving Past Shallow Incident Data.

And, of course, I invite readers to continue reading this humble blog on the topic of gathering deeper insights from incidents.

Quick takes on the latest Cloudflare public incident write-up

Cloudflare consistently generates the highest quality public incident writeups of any tech company. Their latest is no exception: Cloudflare incident on November 14, 2024, resulting in lost logs.

I wanted to make some quick observations about how we see some common incident patterns here. All of the quotes are from the original Cloudflare post.

Saturation (overload)

In this case, a misconfiguration in one part of the system caused a cascading overload in another part of the system, which was itself misconfigured. 

A very common failure mode in incidents is when the system reaches some limit, where it cannot keep up with the demands put upon it. The blog post uses the term overload, and often you hear the term resource exhaustion. Brendan Gregg uses the term saturation in his USE method for analyzing system performance.

A short temporary misconfiguration lasting just five minutes created a massive overload that took us several hours to fix and recover from.

The resilience engineering research David Woods uses the term saturation in a more general sense, to refer to a system being in a state where it can no longer meet the demands put upon it. The challenge of managing the risk of saturation is a key part of his theory of graceful extensibility.

It’s genuinely surprising how many incidents involve saturation, and how difficult it can be to recover when the system saturates.

This massive increase, resulting in roughly 40 times more buffers, is not something we’ve provisioned Buftee clusters to handle. 

For other examples, see some of these other posts I’ve written:

When safety mechanism make things worse (Lorin’s law)

In a previous blog post entitled A conjecture on why reliable systems fail, I wrote:

Once a system reaches a certain level of reliability, most major incidents will involve:

  • A manual intervention that was intended to mitigate a minor incident, or
  • Unexpected behavior of a subsystem whose primary purpose was to improve reliability

In this case, it was a failsafe mechanism that enabled the saturation failure mode (emphasis in the original):

This bug essentially informed Logfwdr that no customers had logs configured to be pushed. The team quickly noticed the mistake and reverted the change in under five minutes.

Unfortunately, this first mistake triggered a second, latent bug in Logfwdr itself. A failsafe introduced in the early days of this feature, when traffic was much lower, was configured to “fail open”. This failsafe was designed to protect against a situation when this specific Logfwdr configuration was unavailable (as in this case) by transmitting events for all customers instead of just those who had configured a Logpush job. This was intended to prevent the loss of logs at the expense of sending more logs than strictly necessary when individual hosts were prevented from getting the configuration due to intermittent networking errors, for example.

Note: I had not yet read the Cloudflare writeup when I originally posted this!

Automated safety mechanisms themselves add complexity, and we are no better at implementing bug-free safety code than we are at implementing bug-free feature code. The difference is that when safety mechanisms go awry, they tend to be much more difficult to deal with, as we saw here.

I’m not opposed to automatic safety mechanisms! For example, I’m a big fan of autoscalers, which are an example of an automated safety mechanism. But it’s important to be aware of there’s a tradeoff: they prevent simpler incidents but enable new, complex incidents. The lesson I take away is that we need to get good at dealing with complex incidents where these safety mechanisms will inevitably contribute to the problem.

Complex interactions (multiple contributing factors)

Unfortunately, this first mistake triggered a second, latent bug in Logfwdr itself.

(Emphasis mine)

I am a card-carrying member of the “no root cause” club: I believe that all complex systems failures result from the interaction of multiple contributors that all had to be present for the incident to occur and to be as severe as it was.

When this failsafe was first introduced, the potential list of customers was smaller than it is today. 

In this case, we see the interaction of multiple bugs

Even given this massive overload, our systems would have continued to send logs if not for one additional problem. Remember that Buftee creates a separate buffer for each customer with their logs to be pushed. When Logfwdr began to send event logs for all customers, Buftee began to create buffers for each one as those logs arrived, and each buffer requires resources as well as the bookkeeping to maintain them. This massive increase, resulting in roughly 40 times more buffers, is not something we’ve provisioned Buftee clusters to handle. 

(Emphasis mine)

 A huge increase in the number of buffers is a failure mode that we had predicted, and had put mechanisms in Buftee to prevent this failure from cascading.  Our failure in this case was that we had not configured these mechanisms.  Had they been configured correctly, Buftee would not have been overwhelmed.

The two issues that the authors explicitly call out in the (sigh) root causes section are:

  • A bug that resulted in a blank configuration being provided to Logfwdr
  • Incorrect Buftee configuration for preventing failure cascades

However, these are also factors that enabled the incident.

  • The presence of failsafe (fail open) behavior
  • The increase in size of the potential list of customers over time
  • Buftee implementation that creates a separate buffer for each customer with logs to be pushed
  • The amount of load that Buftee was provisioned to handle

I’ve written about the problems with the idea of root cause several times in the past, including:

Keep an eye out for those patterns!

In your own organization, keep an eye out for patterns like saturation, when safety mechanisms make things worse, and complex interactions. They’re easy to miss if you aren’t explicitly looking for them.

The Tortoise and the Hare in Alloy

If you’ve done your share of leetcode-style interviewing, and you’re above a certain age, you may have been asked during a technical screen to write a program that determines if a linked list contains a cycle. If the interviewer was really tough on you, they might have asked how to implement this in O(1) space.

There’s a well-known O(1) algorithm for finding cycles in linked lists, attributed to Robert Floyd, called the tortoise and the hare. I’ve previously written about modeling this algorithm in TLA+. In this post, I’m going to do it in Alloy. Version 6 of Alloy added support for temporal operators, which makes it easier to write TLA+ style models, with the added benefit of Alloy’s visualizer. This was really just an excuse for me to play with these operators.

You can find my model at https://github.com/lorin/tortoise-hare-alloy/

Brief overview of the algorithm

Basic strategy: define two pointers that both start at the head of the list. At each iteration, you advance one of the pointers (the tortoise) by one step, and the other (the hare) by two steps. If the hare reaches the tail of the list, there are no cycles. If the tortoise and the hare ever point to the same node, there’s a cycle.

With that out of the way, let’s model this algorithm in Alloy!

Linked lists

Let’s start off by modeling linked lists. Here’s the basic signature.

sig Node {
    next : lone Node
}

Every linked list has a head. Depending on whether there’s a cycle, it may or may not have a tail. But we do know that it has at most one tail.

one sig Head in Node {}
lone sig Tail in Node {}

Let’s add a fact about the head, using Alloy reflexive transitive closure operator (*).

fact "all nodes are reachable from the head" {
    Node in Head.*next
}

You can think of Head.*next as meaning “every node that is reachable from Head, including Head itself”.

Finally, we’ll add a fact about the tail:

fact "the tail is the only node without a successor" {
    all n : Node | no n.next <=> n = Tail
}

We can now use Alloy to generate some instances for us to look at. Here’s how to tell Alloy to generate an instance of the model that contains 5 nodes and has a tail:

acyclic: run { some Tail } for exactly 5 Node

This is what we see in the visualizer:

We can also tell Alloy to generate an instance without a tail:

cycle: run { no Tail } for exactly 5 Node

Here are three different instances without tails:

Tortoise and hare tokens

The tortoise and the hare are pointers to the nodes. However, I like to think of them like tokens moving along a game board, so I called them tokens. Here’s how I modeled them:

abstract sig Token {
    var at : Node
}

one sig Tortoise, Hare extends Token {}

Note that the Token.at field has a var prefix. That’s new in Alloy 6, and it means that the field can change over time.

As in TLA+, we need to specify the initial state for variables that change over time. Both tokens start at the head, which we can express as a fact.

fact init {
    Token.at in Head
}

Next, as in TLA+, we need to model how variables change over time.

Here’s the predicate that’s true whenever the tortoise and hare take a step. Alloy uses the same primed variable notation as TLA+ to refer to “the value of the variable in the next state”. In TLA+, we’d call this kind of predicate an action, because it contains primed variables:

pred move {
    Tortoise.at' = advance[Tortoise.at]
    Hare.at' = advance[advance[Hare.at]]
}

This predicate uses a helper function I wrote called advance which takes a pointer to a node and advances to the next node, unless it’s at the tail, in which case it stays where it is:

fun advance[n : Node] : Node {
    (n = Tail) implies n else n.next
}

We can run our model like this, using the always temporal operator to indicate that the move predicate is true at every step.

run {always move} for exactly 5 Node

Here’s a screenshot of Alloy’s visualizer UI for one of the traces. You can see that there are 5 states in the trace, and it’s currently displaying state 2.

Here are all of the states in the trace:

It’s confusing to follow what’s happening over time because Alloy re-arranges the layout of the nodes at the different steps. We’ll see later on this post how we can configure the visualizer so to make it easier to follow.

Output of the algorithm

So far we’ve modeled the movement of the tortoise and hare tokens, but we haven’t fully modeled the algorithm, because we haven’t modeled the return value, which is supposed to indicate whether there’s a cycle or not.

I modeled the return value as a Result signature, like this:

abstract sig CycleStatus {}
one sig Cycle, NoCycle, Running extends CycleStatus {}

var one sig Result in CycleStatus {}

You can think of Result as being like an enum, which can take on values of either Cycle, NoCycle, or Running.

Note that Result has a var in front, meaning it’s a variable that can change over time. It starts off in the Running state, so let’s augment our init fact:

fact init {
    Token.at in Head
    Result = Running
}

Let’s also define termination for this algorithm. Our algorithm is done when Result is either Cycle or NoCycle. Once it’s done, we no longer need to advance the tortoise and hare pointers. We also don’t change the result once the program has terminated.

pred done {
    Result in Cycle + NoCycle
    Tortoise.at' = Tortoise.at
    Hare.at' = Hare.at
    Result' = Result
}

We need to update the move action so that it updates our Result variable. We also don’t want the move action to be enabled when the algorithm is done (no need to keep advancing the pointers), so we’ll add an enabling condition. As a result, move now looks like this:

pred move {
  // enabling condition
  Result = Running

  // advance the pointers
  Tortoise.at' = advance[Tortoise.at]
  Hare.at' = advance[advance[Hare.at]]

  // update Result if the hare has reached the tail 
  // or tortoise and hare are at the same node
  Hare.at' = Tail implies Result' = NoCycle
                  else    Hare.at' = Tortoise.at' implies Result' = Cycle
                                                  else    Result' = Result
}

The syntax for updating the Result isn’t as nice as in TLA+: Alloy doesn’t have a case statement. It doesn’t even have an if statement: instead we use implies/else to achieve if/else behavior.

We can now define the full spec like this:

pred spec {
    always (
        move or
        done
    )
}

And then we can ask the analyzer to generate traces when spec is true, like this:

example: run { spec } for exactly 5 Node

Improving the visualization

Finally, let’s make the visualization nicer to look at. I didn’t want the tortoise and hare to be rendered by the visualizer as objects separate from the nodes. Instead I wanted them to be annotations on the node.

The analyzer will let you represent fields as attributes, so I could modify the Node signature to add a new field that contains which tokens are currently occupying the node:

sig Node {
    next : lone Node,
    var tokens : set Token // <- new field (I didn't do this)
}

But I didn’t want to add a field to my model.

However, Alloy lets you define a function that returns a Node -> Token relation, and then the visualizer will let you treat this function like it’s a field. This relation is just the inverse of the at relationship that we defined on the Token signature:

// This is so the visualizer can show the tokens as attributes 
// on the nodes
fun tokens[] : Node -> Token {
    ~at
}

Now, in the theme panel of the visualizer, there’s a relation named $tokens.

You can also rename things. In particular, I renamed Tortoise to 🐢 and Hare to 🐇 as well as making them attributes. Here’s a screenshot after the changes:

Here’s an example trace when there’s no cycle. Note how the (Result) changes from Running to NoCycle

Much nicer!

Checking a property

Does our program always terminate? We can check like this:

assert terminates {
    spec => eventually done
}

check terminates for  exactly 5 Node

The analyzer output looks like this:

Solving...
No counterexample found. Assertion may be valid. 9ms.

In general, this sort of check doesn’t guarantee that our program terminates, because our model might be too small.

We can also check for correctness. Here’s how we can ask Alloy to check that there is a cycle in the list if and only if the program eventually terminates with Result=Cycle.

pred has_cycle {
    some n : Node | n in n.^next
}

assert correctness {
     spec => (has_cycle <=> eventually Result=Cycle)
}

check correctness for exactly 5 Node

Note that we’re using the transitive closure (^) in the has_cycle predicate to check if there’s a cycle. That predicate says: “there is some node in that is reachable from itself”.

Further reading

For more on Alloy 6, check out the following:

Alloy’s very easy to get started with: I recommend you give it a go!

TTR: the out-of-control metric

I’m currently reading The Machine That Changed The World. This is a book written back in 1990 comparing Toyota’s approach to automobile manufacturing to the approach used by American car manufacturers. It’s one of the earlier books that popularized the concept of lean manufacturing in the United States.

The software world has drawn a lot of inspiration from lean manufacturing over the past two decades, as is clear from the titles of influential software books such as Implementing Lean Software Development by Tom Poppendieck and Mary Poppendieck (2006), The Principles of Product Development Flow: Second Generation Lean Product Development by Don Reinersten (2009), The Lean Startup by Eric Ries (2011), Lean UX by Jeff Gothelf and Josh Sieden (first published in 2013), and Accelerate: The Science of Lean Software by Nicole Forsgren PhD, Jez Humble, and Gene Kim (2018). Another signal is the proliferation of Kanban boards, which are a concept taken from Toyota. I’ve also seen continuous delivery compared to single-piece flow from lean manufacturing, although I suspect that’s more a case of convergent evolution than borrowing.

In The Machine That Changed The World, the authors mention in passing how Toyota uses the five-whys problem identification technique. I had forgotten that five whys has its origins in manufacturing. This post isn’t about five whys, but it is about how applying concepts from manufacturing to incidents can lead us astray, because of assumptions that turn out to be invalid. For that, I’m going to turn to W. Edwards Deming and the idea of statistical control.

Deming & Statistical control

Deming is the famous American statistician who had enormous influence on the Japanese manufacturing industry in the second half of the twentieth century. My favorite book of his is Out of the Crisis, originally published in 1982, which I highly recommend.

One of the topics Deming wrote on was about a process being under statistical control, with the focus of his book being on manufacturing processes in particular. For example, imagine you’re tracking some metric of interest (e.g., defect rate) for a manufacturing process.

(Note: I have no experience in the manufacturing domain, so you should treat this is as a stylized, cartoon-ish view of things).

Deming argued that when a process is under statistical control, focusing on individual defects, or even days, where the defects are higher than average, is a mistake. To make this more concrete, you can compute an upper control limit and lower control limit based on the statistics of the observed data. There is variation inherent in the process, and focusing on the individual data points that happen to be higher than the average won’t lead to actual improvements.

The process with computed upper and lower control limits. This graph is sometimes called as a control chart.

Instead, in order to make an improvement, you need to make a change to the overall system. This is where Toyota’s five-whys would come in, where you’d identify a root cause, a systemic issue behind why the average rate is as high at is. Once you identified a root cause, you’d apply what Deming called the Plan-Do-Check-Act cycle, where you’d come up with an intervention, apply it, observe whether the intervention has actually achieved the desired improvement, and then react accordingly.

I think people have attempted to apply these concepts to improving availability, where time-to-resolve (TTR) is the control metric. But it doesn’t work the way it does in manufacturing. And the reason it doesn’t has everything to do with the idea of statistical control.

Out of control

Now, let’s imagine a control chart that looks a little different.

In the chart above, there are multiple points that are well outside the control limits. This is a process that is not under statistical control.

Deming notes that, when a process is not under statistical control, statistics associated with the process are meaningless:

Students are not warned in classes nor in the books that for analytic purposes (such as to improve a process), distributions and calculations of mean, mode, standard deviation, chi-square, t-test, etc. serve no useful purpose for improvement of a process unless the data were produced in a state of statistical control. – W. Edwards Deming, Out of the Crisis

Now, I’m willing to bet that if you were to draw a control chart for the time-to-resolve (TTR) metric for your incidents, it would look a lot more like the second control chart than the first one, that you’d have a number of incidents whose TTRs are well outside of the upper control limit.

The reason I feel confident saying this is because when an incident is happening, your system is out of control. This actually is a decent rough-and-ready definition of an incident: an event when your system goes out of control.

Time-to-resolve is a measure of how long your system was out of control. But because your system was out of control, then it isn’t a meaningful metric to perform statistical analysis on. As per the Deming quote above, mean-time-to-resolve (MTTR) serves no useful purpose for improvement.

Anyone who does operations work will surely sympathize with the concept that “a system during an incident is not under statistical control”. Incidents are often chaotic affairs, and individual events (a chance ordering by the thread scheduler, a responder who happens to remember a recent Slack message, someone with important knowledge happens to be on PTO) can mean the difference between a diagnosis that takes minutes versus one that takes hours.

As John Allspaw likes to say, a large TTR cannot distinguish between a complex incident handled well and a simple incident handled poorly. There are too many factors that can influence TTR to conclude anything useful from the metric alone.

Conclusion

To recap:

  1. When a system is out of control, statistical analysis on system metrics are useless as a signal for improving the system.
  2. Incidents are, by definition, events when the system is out control.

TTR, in particular, is a metric that only applies when the system is out of control. It’s really just a measure of how long the system was out of control.

Now, this doesn’t mean that we should throw up our hands and say “we can’t do anything to improve our ability to resolve incidents.” It just means that we need to let go of a metrics-based approach.

Think back to Allspaw’s observation: was your recent long incident a complex one handled well or a simple one handled poorly? How would you determine that? What questions would you ask?

Reading the Generalized Isolation Level Definitions paper with Alloy

My last few blog posts have been about how I used TLA+ to gain a better understanding of database transaction consistency models. This post will be in the same spirit, but I’ll be using a different modeling tool: Alloy.

Like TLA+, Alloy is a modeling language based on first-order logic. However, Alloy’s syntax is quite different: defining entities in Alloy feels like defining classes in an object-oriented language, including the ability to define subtypes. It has first class support for relations, which makes it very easy to do database-style joins. It also has a very nifty visualization tool, which can help when incrementally defining a model.

I’m going to demonstrate here how to use it to model and visualize database transaction execution histories, based on the paper Generalized Isolation Level Definitions by Atul Adya, Barbara Liskov and Patrick O’Neil. This is a shorter version of Adya’s dissertation work, which is referenced frequently on the Jepsen consistency models pages.

There’s too much detail in the models to cover in this blog post, so I’m just going to touch on some topics. If you’re interested in more details, all of my models are in my https://github.com/lorin/transactions-alloy repository.

Modeling entities in Alloy

Let’s start with a model with the following entiteis:

  • objects
  • transactions (which can commit or abort)
  • events (read, write, commit, abort)

The diagram below shows some of the different entities in our Alloy model of the paper.

You can think of the above like a class diagram, with the arrows indicating parent classes.

Objects and transactions

Objects and transactions are very simple in our model.

sig Object {}

abstract sig Transaction {}

sig AbortedTransaction, CommittedTransaction extends Transaction {}

The word sig is a keyword in Alloy that means signature. Above I’ve defined transactions as abstract, which means that they can only be concretely instantiated by subtypes.

I have two subtypes, one for aborted transactions, one for committed transactions.

Events

Here are the signatures for the different events:

abstract sig Event {
    tr: Transaction,
    eo: set Event, // event order (partial ordering of events)
    tnext: lone Event // next event in the transaction
}

// Last event in a transaction
abstract sig FinalEvent extends Event {}

sig Commit extends FinalEvent {}

sig Abort extends FinalEvent {}

sig Write extends Event {
    obj: Object,
    wn : WriteNumber
}

sig Read extends Event {
    sees: Write // operation that did the write
}

The Event signature has three fields:

  • tr – the transaction associated with the event
  • eo – an event ordering relationship on events
  • tnext – the next event in the transaction

Reads and writes each have additional fielsd.

Reads

A read has one custom field, sees. In our model, a read operation “sees” a specific write operation. We can follow that relationship to identify the object and the value being written.

Writes and write numbers

The Write event signature has two additional fields:

  • obj – the object being written
  • wn – the write number

Following Adya, we model the value of a write by a (transaction, write number) pair. Every time we write an object, we increment the write number. For example, if there are multiple writes to object X, the first write has write number 1, the second has write number 2, and so on.

We could model WriteNumber entities as numbers, but we don’t need full-fledged numbers for this behavior. We just need an entity that has an order defined (e.g., it has to have a “first” element, and each element has to have a “next” element). We can use the ordering module to specify an ordering on WriteNumber:

open util/ordering[WriteNumber] as wo

sig WriteNumber {}

Visualizing

We can use the Alloy visualizer to generate a visualization of our model. To do that, we just need to use the run command. Here’s the simplest way to invoke that command:

run {} 

Here’s what the generated output looks like:

Visualization of a history with the default theme

Yipe, that’s messy! We can clean this up a lot by configuring the theme in the visualizer. Here’s the same graph, with different theme settings. I renamed several things (e.g., “Ta” instead of “AbortedTransaction”), I had some relationships I didn’t care about (eo), and I showed some relationships as attributes instead of arcs (e.g., tr).

Visualization after customizing the theme

The diagram above shows two transactions (Ta, Tc). Transaction Ta has a read operation (Read) and a write operation (Write0). Transaction Tc has a write operation. (Write1).

Constraining the model with facts

The history above doesn’t make much sense:

  • tnext is supposed to represent “next event in the transaction”, but in each transaction, tnext has loops in it
  • Ta belongs to the set of aborted transactions, but it doesn’t have an abort event
  • Tc belongs to the set of committed transactions, but it doesn’t have a commit event

We need to add constraints to our model so that it doesn’t generate nonsensical histories. We do this in Alloy by adding facts.

For example, here are some facts:

// helper function that returns set of events associated with a transaction
fun events[t : Transaction] : set Event {
    tr.t
}

fact "all transactions contain exactly one final event" {
    all t : Transaction | one events[t] & FinalEvent
}

fact "nothing comes after a final event" {
    no FinalEvent.tnext
}

fact "committed transactions contain a commit" {
    all t : CommittedTransaction | some Commit & events[t]
}

fact "aborted transactions contain an abort" {
    all t : AbortedTransaction | some Abort & events[t]
}

Now our committed transactions will always have a commit! However, the facts above aren’t sufficient, as this visualization shows:

I won’t repeat all of the facts here, you can see them in the transactions.als file.

Here’s one last example of a fact, encoded from the Adya paper:

The corresponding fact in my model is:

/**
 * If an event wi (xi:m) is followed by ri (xj) without an intervening event wi (xi:n) in E, xj must be xi:m
 */
fact "transaction must read its own writes" {
    all T : Transaction, w : T.events & Write, r : T.events & Read | ({
            w.obj = r.sees.obj
            w->r in eo
            no v : T.events & Write | v.obj = r.sees.obj and (w->v + v->r) in eo
    } => r.sees = w)
}

With the facts specified, our generated histories no longer look absurd. Here’s an example:

Mind you, this still looks like an incorrect history: we have two transactions (Tc1, Tc0) that commit after reading a write from an aborted transaction (Ta).

Installed versions

The paper introduces the concept of a version of an object. A version is installed by a committed transaction that contains a write.

I modeled versions like this.

open util/ordering[VersionNumber] as vo

sig VersionNumber {}

// installed (committed) versions
sig Version {
    obj: Object,
    tr: one CommittedTransaction,
    vn: VersionNumber
}

Dependencies

Once we have versions defined, we can model the dependencies that Adya defines in his paper. For example, here’s how I defined the directly write-depends relationship, which Adya calls ww in his diagrams.

fun ww[] : CommittedTransaction -> CommittedTransaction {
    { disj Ti, Tj : CommittedTransaction | some v1 : installs[Ti], v2 : installs[Tj] | {
        same_object[v1, v2]
        v1.tr = Ti
        v2.tr = Tj
        next_version[v1] = v2
        }
    }
}

Visualizing counterexamples

Here’s a final visualizing example of visualizations. The paper A Critique of ANSI SQL Isolation Levels by Berenson et al. provides some formal definition of different interpretations of the ANSI SQL specification. One of them they call “anomaly serializable (strict interpretation)”.

We can build a model this interpretation in Alloy. Here’s part of it just to give you a sense of what it looks like, see my bbg.als file for the complete model:

pred AnomalySerializableStrict {
    not A1
    not A2
    not A3
}

/**
 * A1: w1[x]...r2[x]...(a1 and c2 in any order)
 */
pred A1 {
    some T1 : AbortedTransaction,
         T2 : CommittedTransaction,
         w1: Write & events[T1],
         r2 : Read & events[T2],
         a1 : Abort & events[T1] | {

       w1->r2 in teo
       r2.sees = w1
       // r2 has to happen before T1 aborts
       r2->a1 in teo
    }
}

...

And then we can ask Alloy to check whether AnomalySerializableStrict implies Adya’s definition of serializability (which he calls isolation level PL-3).

Here’s how I asked Alloy to check this:

assert anomaly_serializable_strict_implies_PL3 {
    always_read_most_recent_write => (b/AnomalySerializableStrict => PL3)
}

check anomaly_serializable_strict_implies_PL3
for 8 but exactly 3 Transaction, exactly 2 Object, exactly 1 PredicateRead, exactly 1 Predicate

Alloy tells me that the assertion is invalid, and shows the following counterexample:

This shows a history that satisfies the anomaly serializable (strict) specificaiton, but not Adya’s PL-3. Note that the Alloy visualizer has generated a direct serialization graph (DSG) in the bottom left-hand corner, which contains a cycle.

Predicate reads

This counterexample involves predicate reads, which I hadn’t shown modeled before, but they look like this:

// transactions.als

abstract sig Predicate {}

abstract sig PredicateRead extends Event {
    p : Predicate,
    objs : set Object
}

// adya.als

sig VsetPredicateRead extends PredicateRead {
    vset : set Version
}

sig VsetPredicate extends Predicate {
    matches : set Version
}

A predicate read is a read that returns a set of objects.

In Adya’s model, a predicate read takes as input a version set (a version for each object) and then determines which objects should be included in the read based on whether or not the versions match the predicate:

fact "objects in predicate read are the objects that match in the version set" {
    all pread : VsetPredicateRead |
        pread.objs = (pread.vset & pread.p.matches).obj
}

One more counterexample

We can also use Alloy to show us when a transaction would be permitted by Adya’s PL-3, but forbidden by the broad interpretation of anomaly serializability:

check PL3_implies_anomaly_serializable_broad
for 8 but exactly 3 Transaction, exactly 2 Object, exactly 1 PredicateRead, exactly 1 Predicate

assert PL3_implies_anomaly_serializable_broad {
    always_read_most_recent_write => (PL3 => b/AnomalySerializableBroad)
}

The example above shows the “gnext” relation, which yields a total order across events.

The resulting counterexample is two aborted transactions! Those are trivially serializable, but it is ruled out by the broad definition, specifically, P3:

Playing with Alloy

I encourage you to try Alloy out. There’s a great Visual Studio code plugin that will let you execute your Alloy models from VS Code, that’s what I’ve been using. It’s very easy to get started with Alloy, because you can get it to generate visualizations for you out of the simplest models.

For more resources, Hillel Wayne has written a set of Alloy docs that I often turn to. There’s even an entire book on Alloy written by its creator. (Confusingly, though, the book does not have the word Alloy in the name).

Extending MVCC to be serializable, in TLA+

In the previous blog post, we saw how a transaction isolation strategy built on multi-version concurrency control (MVCC) does not implement the serializable isolation level. Instead, it implements a weaker isolation level called snapshot isolation. In this post, I’ll discuss how that MVCC model can be extended in order to achieve serializability, based on work published by Michael Cahill, Uwe Röhm, and Alan Fekete.

You can find the model I wrote in the https://github.com/lorin/snapshot-isolation-tla repo, in the SSI module (source, pdf).

A quick note on conventions

In this post, I denote read of x=1 as r[x,1]. This means a transaction read the object x which returned a value of 1. As I mentioned in the previous post, you can imagine a read as being the following SQL statement:

SELECT v FROM obj WHERE k='x';

Similarly, I denote a write of y←2 as w[y,2]. This means a transaction wrote the object y with a value of 2. You can imagine this as:

UPDATE obj SET v=2 WHERE k='y';

Finally, I’ll assume that there’s an initial transaction (T0) which sets the values of all of the objects to 0, and has committed before any other transaction starts.

We assume this transaction always precedes all other transactions

Background

The SQL isolation levels and phenomena

The ANSI/ISO SQL standard defines four transaction isolation levels: read uncommitted, read committed, repeatable read, and serializable. The standard defines the isolation levels in terms of the phenomena they prevent. For example, the dirty read phenomenon is when one transaction reads a write done by a concurrent transaction that has not yet committed. Phenomena are dangerous because they may violate a software developer’s assumptions about how the database will behave, leading to software that behaves incorrectly.

Problems with the standard and a new isolation level

Berenson et al. noted that the standard’s wording is ambiguous, and of the two possible interpretations of the definitions, one was incorrect (permitting invalid execution histories) and the other was overly strict (proscribing valid execution histories).

The overly strict definition implicitly assumed that concurrency control would be implemented using locking, and this ruled out valid implementations based on alternate schemes, in particular, multi-version concurrency control. They also proposed a new isolation level: snapshot isolation.

Formalizing phenomena and anti-dependencies

In his PhD dissertation work, Adya introduced a new formalization for reasoning about transaction isolation. The formalism is based on a graph of direct dependencies between transactions.

One type of dependency Adya introduced is called an anti-dependency, which is critical to the difference between snapshot isolation and serializable.

An anti-dependency between two concurrent transactions is when one read an object and the other writes the object with a different value, for example:

T1 is said to have an anti-dependency on T2: T1 must come before T2 in a serialization:

If T2 is sequenced before T1, then the read will not match the most recent write. Therefore, T1 must come before T2.

In dependency graphs, anti-dependencies are labeled with rw because the transaction which does the read must be sequenced before the transaction that does the write, as shown above.

Adya demonstrated that for an implementation that supports snapshot isolation to generate execution histories that are not serializable, there must be a cycle in the dependency graph that includes an anti-dependency.

Non-serializable execution histories in snapshot isolation

In the paper Making Snapshot Isolation Serializable, Fekete et al. further narrowed the conditions under which snapshot isolation could lead to a non-serializable execution history, by proving the following theorem:

THEOREM 2.1. Suppose H is a multiversion history produced under Snapshot Isolation that is not serializable. Then there is at least one cycle in the serialization graph DSG(H), and we claim that in every cycle there are three consecutive transactions Ti.1, Ti.2, Ti.3 (where it is possible that Ti.1 and Ti.3 are the same transaction) such that Ti.1 and Ti.2 are concurrent, with an edge Ti.1 → Ti.2, and Ti.2 and Ti.3 are concurrent with an edge Ti.2 → Ti.3.

They also note:

By Lemma 2.3, both concurrent edges whose existence is asserted must be anti-dependencies:Ti.1→ Ti.2 and Ti.2→ Ti.3.

This means that one of the following two patterns must always be present in a snapshot isolation history that is not serializable:

Non-serializable snapshot isolation histories must contain one of these as subgraphs in the dependency graph

Modifying MVCC to avoid non-serializable histories

Cahill et al. proposed a modification to MVCC that can dynamically identify potential problematic transactions that could lead to non-serializable histories, and abort them. By aborting these transactions, the resulting algorithm guarantees serializability.

As Fekete et al. proved, under snapshot isolation, cycles can only occur if there exists a transaction which contains an incoming anti-dependency edge and an outgoing anti-dependency edge, which they call pivot transactions.

Pivot transactions shown in red

Their approach is to identify and abort pivot transactions: if an active transaction contains both an outgoing and an incoming anti-dependency, the transaction is aborted. Note that this is a conservative algorithm: some of the transactions that it aborts may have still resulted in serializable execution histories. But it does guarantee serializability.

Their modification to MVCC involves some additional bookkeeping:

  1. Reads performed by each transaction
  2. Which transactions have outgoing anti-dependencies
  3. Which transactions have incoming anti-dependencies

The tracking of reads is necessary to identify the presence of anti-dependencies, since an anti-dependency always involve a read (outgoing dependency edge) and a write (incoming dependency edge).

Extending our MVCC TLA+ model for serializability

Adding variables

I created a new module called SSI, which stands for Serializable Snapshot Isolation. I extended the MVCC model to add three variables to implement the additional bookkeeping required by the Cahill et al. algorithm. MVCC already tracks which objects are written by each transaction, but we need to now also track reads.

  • rds – which objects are read by which transactions
  • outc – set of transactions that have outbound anti-dependencies
  • inc – set of transactions that have inbound anti-dependencies

TLA+ is untyped (unless you’re using Apalache), but we can represent type information by defining a type invariant (above, called TypeOkS). Defining this is useful both for the reader, and because we can check that this holds with the TLC model checker.

Changes in behavior: new abort opportunities

Here’s how the Next action in MVCC compares to the equivalent in SSI.

Note: Because extending the MVCC module brings all of the MVCC names into scope, I had to create new names for each of the equivalent actions in SSI, I did this by appending an S (e.g., StartTransactionS, DeadlockDetectionS).

In our original MVCC implementation, reads and commits always succeeded. Now, it’s possible for an attempted read or an attempted to commit to result in aborts as well, so we needed an action for this, which I called AbortRdS.

Commits can now also fail, so instead of having a single-step Commit action, we now have a BeginCommit action, which will complete successfully by an EndCommit action, or fail with an abort by the AbortCommit action. Writes can also now abort due to the potential for introducing pivot transactions.

Finding aborts with the model checker

Here’s how I used the TLC model checker to generate witnesses of the new abort behaviors:

Aborted reads

To get the model checker to generate a trace for an aborted read, I defined the following invariant in the MCSSI.tla file:

Then I specified it as an invariant to check in the model checker in the MCSSI.cfg file:

INVARIANT 
    NeverAbortsRead

Because aborted reads can, indeed, happen, the model checker returned an error, with the following error trace:

The resulting trace looks like this, with the red arrows indicating the anti-dependencies.

Aborted commit

Similarly, we can use the model checker to identify scenarios where it a commit would fail, by specifying the following invariant:

The checker finds the following violation of that invariant:

While T2 is in the process of committing, T1 performs a read which turns T2 into a pivot transaction. This results in T2 aborting.

Checking serializability using refinement mapping

Just like we did previously with MVCC, we can define a refinement mapping from our SSI spec to our Serializability spec. You can find it in the SSIRefinement module (source, pdf). It’s almost identical to the MVCCRefinement module (source, pdf), with some minor modifications to handle the new abort scenarios.

The main difference is that now the refinement mapping should actually hold, because SSI ensures serializability! I wasn’t able to find a counterexample when I ran the model checker against the refinement mapping, so that gave me some confidence in my model. Of course, that doesn’t prove that my implementation is correct. But it’s good enough for a learning exercise.

Coda: on extending TLA+ specifications

Serializable Snapshot Isolation provides us with a nice example of when we can extend an existing specification rather than create a new one from scratch.

Even so, it’s still a fair amount of work to extend an existing specification. I suspect it would have been less work to take a copy-paste-and-modify approach rather than extending it. Still, I found it a useful exercise in learning how to modify a specification by extending it.

Multi-version concurrency control in TLA+

In a previous blog post, I talked about how we can use TLA+ to specify the serializability isolation level. In this post, we’ll see how we can use TLA+ to describe multi-version concurrency control (MVCC), which is a strategy for implementing transaction isolation. Postgres and MySQL both use MVCC to implement their repeatable read isolation levels, as well as a host of other databases.

MVCC is described as an optimistic strategy because it doesn’t require the use of locks, which reduces overhead. However, as we’ll see, MVCC implementations aren’t capable of achieving serializability.

All my specifications are in https://github.com/lorin/snapshot-isolation-tla.

Modeling MVCC in TLA+

Externally visible variables

We use a similar scheme as we did previously for modeling the externally visible variables. The only difference now is that we are also going to model the “start transaction” operation:

Variable nameDescription
opthe operation (start transaction, read, write, commit, abort), modeled as a single letter: {“s”, “r”, “w”, “c”, “a”} )
argthe argument(s) to the operation
rvalthe return value of the operation
trthe transaction executing the operation

The constant sets

There are three constant sets in our model:

  • Obj – the set of objects (x, y,…)
  • Val – the set of values that the objects can take on (e.g., 0,1,2,…)
  • Tr – the set of transactions (T0, T1, T2, …)

I associate the initial state of the database with a previously committed transaction T0 so that I don’t have to treat the initial values of the database as a special case.

The multiversion database

In MVCC, there can be multiple versions of each object, meaning that it stores multiple values associated with each object. Each of these versions is also has information on which transaction created it.

I modeled the database in TLA+ as a variable named db, here is an invariant that shows the values that db can take on:

It’s a function that maps objects to a set of version records. Each version record is associated with a value and a transaction. Here’s an example of a valid value for db:

Example db where Obj={x,y}

Playing the home game with Postgres

Postgres’s behavior when you specify repeatable read isolation level appears to be consistent with the MVCC TLA+ model I wrote so I’ll use it to illustrate some how these implementation details play out. As Peter Alvaro and Kyle Kingsbury note in their Jepsen analysis of MySQL 8.0.34, Postgres’s repeatable read isolation level actually implements snapshot isolation, while MySQL’s repeatable read isolation level actually implements …. um … well, I suggest you read the analysis.

I created a Postgres database named tla. Because Postgres defaults to read committed, I changed the default to repeatable read on my database so that it would behave more like my model.

ALTER DATABASE tla SET default_transaction_isolation TO 'repeatable read';

create table obj (
    k char(1) primary key,
    v int
);

insert into obj (k,v) values ('x', 0), ('y', 0);

Starting a transaction: id and visibility

In MVCC, each transaction gets assigned a unique id, and ids increase monotonically.

Transaction id: tid

I modeled this with a function tid that maps transactions to natural numbers. I use a special value called None for the transaction id for transactions who have not started yet.

When a transaction starts, I assign it an id by finding the largest transaction id assigned so far (mxid), and then adding 1. This isn’t efficient, but for a TLA+ spec it works quite nicely:

In Postgres, you can get the ID of the current transaction by using the pg_current_xact_id function. For examplle:

$ psql tla
psql (17.0 (Homebrew))
Type "help" for help.

tla=# begin;
BEGIN
tla=*# select pg_current_xact_id();
 pg_current_xact_id
--------------------
                822
(1 row)

Visible transactions: vis

We want each transaction to behave as if it is acting against a snapshot of the database from when the transaction started.

We can implement this in MVCC by identifying the set of transactions that have previously committed, and ensuring that our queries only read from writes done by these transactions.

I modeled this with a function called vis which maps each transaction to a set of other transactions. We also want our own writes to be visible, so we include the transaction being started in the set of visible transactions:

For each snapshot, Postgres tracks the set of committed transactions using three variables:

  1. xmin – the lowest transaction id associated with an active transaction
  2. xmax – (the highest transaction id associated with a committed transaction) + 1
  3. xip_list – the list of active transactions whose ids are less than xmax

In Postgres, you can use the pg_current_snapshot function, which returns xmin:xmax:xip_list:

tla=# SELECT pg_current_snapshot();
 pg_current_snapshot
---------------------
 825:829:825,827

Here’s a visualization of this scenario:

These three variables are sufficient to determine whether a particular version is visible. For more on the output of pg_current_snapshot, check out the Postgres operations cheat sheet wiki.

Performing reads

A transaction does a read using the Get(t, obj) operator. This operator retrieves the visible version with the largest transaction id:

Performing writes

Writes are straightforward, they simply add new versions to db. However, if a transaction did a previous write, that previous write has to be removed. Here’s part of the action that writes obj with value val for transaction t:

The lost update problem and how MVCC prevents it

Consider the following pair of transactions. They each write the same value and then commit.

A serializable execution history

This is a serializable history. It actually has two possible serializations: T1,T2 or T2,T1

Now let’s consider another history where each transaction does a read first.

A non-serializable execution history

This execution history isn’t serializable anymore. If you try to sequence these, the second read will read 2 where it should read 3 due to the previous write.

Serializability is violated: the read returns 2 instead of 3

This is referred to as the lost update problem.

Here’s a concrete example of the lost update problem. Imagine you’re using a record as a counter: you read the value, increment the result by one, and then write it back.

SELECT v FROM obj WHERE k='x';
-- returns 3
UPDATE obj set v=4 WHERE k='x';

Now imagine these two transactions run concurrently. If neither sees the other’s write, then one of these increments will be lost: you will have missed a count!

MVCC can guard against this by preventing two concurrent transactions from writing to the same object. If transaction T1 has written to a record in an active transaction, and T2 tries to write to the same record, then the database will block T2 until T1 either commits or aborts. If the first transaction commits, the database will abort the second transaction.

You can confirm this behavior in Postgres, where you’ll get an error if you try to write to a record that has previously been written to by a transaction that was active and then committed:

$ psql tla
psql (17.0 (Homebrew))
Type "help" for help.

tla=# begin;
BEGIN
tla=*# update obj set v=1 where k='x';
ERROR:  could not serialize access due to concurrent update
tla=!#

Interestingly, MySQL’s MVCC implementation does not prevent lost updates(!!!). You can confirm this yourself.

Implementing this in our model

In our model, a write is implemented by two actions:

  1. BeginWr(t, obj, val) – the initial write request
  2. EndWr(t, obj, val) – the successful completion of the write

We do not allow the EndWr action to fire if:

  1. There is an active transaction that has written to the same object (here we want to wait until the other transaction commits or aborts)
  2. There is a commit to the same object by a concurrent transaction (here we want to abort)

We also have an action named AbortWr that aborts if a write conflict occurs.

Deadlock!

There’s one problem with the approach above where we block on a concurrent write: the risk of deadlock. Here’s what happens when we run our model with the TLC model checker:

Here’s a diagram of this execution history:

The problem is that T1 wrote x first and T2 wrote y first, and then T1 got blocked trying to write y and T2 got blocked trying to write x. (Note that even though T1 started to write y before T2, T2 completed the write first).

We can deal with this problem by detecting deadlocks and aborting the affected transactions when they happen. We can detect deadlock by creating a graph of dependencies between transactions (just like in the diagram above!) and then look for cycles:

Here TC stands for transitive closure, which is a useful relation when you want to find cycles. I used one of the transitive closure implementations in the TLA+ examples repo.

Top-level of the specification

Here’s a top-level view of the specification, you can find the full MVCC specification in the repo (source, pdf):

Note how reads and writes have begin/end pairs. In addition, a BeginWr can end in an AbortWr if there’s a conflict or deadlock as discussed earlier.

For liveness, we can use weak fairness to ensure that read/write operations complete, transactions start, and that deadlock is detected. But for commit and abort, we need strong fairness, because we can have infinite sequences of BeginRd/EndRd pairs or BeginWr/EndWr pairs and Commit and Abort are not enabled in the middle of reads or writes.

My MVCC spec isn’t serializable

Now that we have an MVCC spec, we can check to see if implements our Serializable spec. In order to do that check, we’ll need to do a refinement mapping from MVCC to Serializable.

One challenge is that the initial state of the Serializable specification establishes the fate of all of the transactions and what their environments are going to be in the future:

The Init state for the Serializable spec

Adding a delay to the Serializability spec

In our MVCC spec, we don’t know in advance if a transaction will commit or abort. We could use prophecy variables in our refinement mapping to predict these values, but I didn’t want to do that.

What I did instead was to create a new specification, SerializabilityD (source, pdf), that delays these predictions until the second step of the behavior:

I could then do a refinement mapping MVCC ⇒ SerializabilityD without having to use prophecy variables.

Verifying that SerializabilityD actually implements Serializability

Note that it’s straightforward to do the SerializabilityD ⇒ Serializability refinement mapping with prophecy variables. You can find it in SerializabilityDRefinement (source, pdf):

The MVCC ⇒ SerializabilityD mapping

The MVCC ⇒ SerializabilityD refinement mapping is in the MVCCRefinement spec (source, pdf).

The general strategy here is:

  1. Execute MVCC until all of the transactions complete, keeping an execution history.
  2. Use the results of the MVCC execution to populate the SerializabilityD variables
  3. Step through the recorded MVCC execution history one operation at a time

The tricky part is step 2, because we need to find a serialization.

Attempting to find a serialization

Once we have an MVCC execution history, we can try to find a serialization. Here’s the relevant part of the SetFate action that attempts to select the to and benv variables from Serializability that will satisfy serializability:

Checking the refinement mapping

The problem with the refinement mapping is that we cannot always find a serialization. If we try to model check the refinement mapping, TLC will error because it is trying to CHOOSE from an empty set.

This MVCC execution history is a classic example of what’s called write skew. Here’s a visual depiction of this behavior:

A non-serializable execution history that is permitted by MVCC

Neither T1,T2 nor T2,t1 is a valid serialization of this execution history:

If we sequence T1 first, then the r[y,0] read violates the serialization. If we sequence T2 first, then the r[x,0] read violates it.

These constraints are what Adya calls anti-dependencies. He uses the abbreviation rw for short, because the dependency is created by a write from one transaction clobbering a read done by the other transaction, so the write has to be sequenced after the read.

Because snapshot isolation does not enforce anti-dependencies, it generates histories that are not serializable, which means that MVCC does not implement the Serializability spec.

Coda

I found this exercise very useful in learning more about how MVCC works. I had a hard time finding a good source to explain the concepts in enough detail for me to implement it, without having to read through actual implementations like Postgres, which has way too much detail. One useful resource I found was these slides on MVCC by Joy Arulraj at Georgia Tech. But even here, they didn’t have quite enough detail, and my model isn’t quite identical. But it was enough to help me get started.

I also enjoyed using refinement mapping to do validation. In the end, these were the refinement mappings I defined:

I’d encourage you to try out TLA+, but it really helps if you have some explicit system in mind you want to model. I’ve found it very useful for deepening my understanding of consistency models.

The carefulness knob

A play in one act

Dramatis personae

  • EM, an engineering manager
  • TL, the tech lead for the team
  • X, an engineering manager from a different team

Scene 1: A meeting room in an office. The walls are adorned with whiteboards with boxes and arrows.

EM: So, do you think the team will be able to finish all of these features by end of the Q2?

TL: Well, it might be a bit tight, but I think it should be possible, depending on where we set the carefulness knob.

EM: What’s the carefulness knob?

TL: You know, the carefulness knob! This thing.

TL leans over and picks a small box off of the floor and places it on the table. The box has a knob on it with numerical markings.

EM: I’ve never seen that before. I have no idea what it is.

TL: As the team does development, we have to make decisions about how much effort to spend on testing, how closely to hew to explicitly documented processes, that sort of thing.

EM: Wait, aren’t you, like, careful all of the time? You’re responsible professionals, aren’t you?

TL: Well, we try our best to allocate our effort based on what we estimate the risk to be. I mean, we’re a lot more careful when we do a database migration than we do when we fix a typo in the readme file!

EM: So… um… how good are you at actually estimating risk? Wasn’t that incident that happened a few weeks ago related to a change that was considered a low risk at the time?

TL: I mean, we’re pretty good. But we’re definitely not perfect. It certainly happens that we misjudge the risk sometimes. I mean, in some sense, isn’t every incident in some sense a misjudgment of risk? How many times do we really say, “Hoo boy, this thing I’m doing is really risky, we’re probably going to have an incident!” Not many.

EM: OK, so let’s turn that carefulness knob up to the max, to make sure that the team is careful as possible. I don’t want any incidents!

LM: Sounds good to me! Of course, this means that we almost certainly won’t have these features done by the end of Q2, but I’m sure that the team will be happy to hear…

EM: What, why???

TL picks up a marker off of the table and walks up to the whiteboard. She draws an x axis and y-axis. She labels the x-axis “carefulness” and the y-axis “estimated completion time”.

TL: Here’s our starting point: the carefulness knob is currently set at 5, and we can properly hit end of Q2 if we keep it at this setting.

EM: What happens if we turn up the knob?

TL draws an exponential curve.

EM: Woah! That’s no good. Wait, if we turn the carefulness knob down, does that mean that we can go even faster?

TL: If we did that, we’d just be YOLO’ing our changes, not doing validation. Which means we’d increase the probability of incidents significantly, which end up taking a lot of time to deal with. I don’t think we’d actually end up delivering any faster if we chose to be less careful than we normally are.

EM: But won’t we also have more incidents at a carefulness setting of 5 than at higher carefulness settings?

TL: Yes, there’s definitely more of a risk that a change that we incorrectly assess as low risk ends up biting us at our default carefulness level. It’s a tradeoff we have to make.

EM: OK, let’s just leave the carefulness knob at the default setting.


Scene 2: An incident review meeting, two and a half months later.

X: We need to be more careful when we make these sorts of changes in the future!

Fin


Coda

It’s easy to forget that there is a fundamental tradeoff between how careful we can be and how much time it will take us to perform a task. This is known as the efficiency-thoroughness trade-off, or ETTO principle.

You’ve probably hit a situation where it’s particularly difficult to automate the test for something, and doing the manual testing is time-intensive, and you developed the feature and tested it, but then there was a small issue that you needed to resolve, and then do you go through all of the manual testing again? We make these sort of time tradeoffs in the small, they’re individual decisions, but they add up, and we’re always under schedule pressure to deliver.

As a result, we try our best to adapt to the perceived level of risk in our work. The Human and Organizational Performance folks are fond of the visual image of the black line versus the blue line to depict the difference between how the work is supposed to be done with how workers adapt to get their work done.

But sometimes these adaptations fail. And when this happens, inevitably someone says “we need to be more careful”. But imagine if you explicitly asked that person at the beginning of a project about where they wanted to set that carefulness knob, and they had to accept that increasing the setting would increase the schedule significantly. If an incident happened, you could then say to them, “well, clearly you set the carefulness knob too low at the beginning of this project”. Nobody wants to explicitly make the tradeoff between less careful and having a time estimate that’s seen as excessive. And so the tradeoff gets made implicitly. We adapt as best we can to the risk. And we do a pretty good job at that… most of the time.

Specifying serializability in TLA+

Concurrency is really, really difficult for humans to reason about. TLA+ itself was borne out of Leslie Lamport’s frustration with the difficulty of write error-free concurrent algorithms:

When I first learned about the mutual exclusion problem, it seemed easy and the published algorithms seemed needlessly complicated.  So, I dashed off a simple algorithm and submitted it to CACM.  I soon received a referee’s report pointing out the error.  This had two effects.  First, it made me mad enough at myself to sit down and come up with a real solution.  The result was the bakery algorithm described in [12].  The second effect was to arouse my interest in verifying concurrent algorithms. 

Modeling concurrency control in database systems is a great use case for TLA+, so I decided to learn use TLA+ to learn more about database isolation. This post is about modeling serializability.

You can find all of the the TLA+ models referenced in this post in my snapshot-isolation-tla repo. This post isn’t about snapshot isolation at all, so think of the name as a bit of foreshadowing of a future blog post, which we’ll discuss at the end.

Modeling a database for reasoning about transaction isolation

In relational databases, data is modeled as rows in different tables, where each table has a defined set of named columns, and there are foreign key relationships between the tables.

However, when modeling transaction isolation, we don’t need to worry about those details. For the purpose of a transaction, all we care about is if any of the columns of a particular row are read or modified. This means we can ignore details about tables, columns, and relations. All we care about are the rows.

The transaction isolation literature talks about objects instead of rows, and that’s the convention I’m going to use. Think of an object like a variable that is assigned a value, and that assignment can change over time. A SQL select statement is a read, and a SQL update statement is a write.

An example of how we’re modeling the database

Note that the set of objects is fixed during the lifetime of the model, it’s only the values that change over time. I’m only going to model reads and writes, but it’s simple enough to extend this model to support creation and deletion by writing a tombstone value to model deletion, and having a not-yet-created-stone value to model an object that has not yet been created in the database.

I’ll use the notation r[obj, val] to refer to a read operation where we read the object obj and get the value val and w[obj, val] to mean where we write the value val to obj. So, for example, setting x=1 would be: w[x, 1], and reading the value of x as 1 would be r[x, 1].

I’m going to use Obj to model the set of objects, and Val to model the set of possible values that objects can take on.

Obj is the set of objects, Val is the set of values that can be assigned to objects

We can model the values of the objects at any point in time as a function that maps objects to values. I’ll call these sorts of functions environments (env for short) since that’s what people who write interpreters call them.

Example of an environment

As an example of syntax, here’s how we would assert in TLA+ that the variable env is a function that maps element of the set Obj to elements of the set Val:

What is serializability?

Here’s how the SQL:1999 standard describes serializability (via the Jepsen serializability page):

The execution of concurrent SQL-transactions at isolation level SERIALIZABLE is guaranteed to be serializable. A serializable execution is defined to be an execution of the operations of concurrently executing SQL-transactions that produces the same effect as some serial execution of those same SQL-transactions. A serial execution is one in which each SQL-transaction executes to completion before the next SQL-transaction begins.

An execution history of reads and writes is serializable if it is equivalent to some other execution history where the committed transactions are scheduled serially (i.e., they don’t overlap in time). Here’s an example of a serializable execution history.

Atul Adya famously came up with a formalism for database isolation levels (including serializability) in his PhD dissertation work, and published this in a paper co-authored by Barbara Liskov (his PhD advisor) and Patrick O’Neil (an author of the original log-structured merge-tree paper and one of the co-authors of the paper A Critique of ANSI SQL Isolation Levels, which pointed out problems in the SQL specification’s definitions of the isolation levels ).

Specifying serializability

Adya formalized database isolation levels by specifying dependencies between transactions. However, I’m not going to use Adya’s approach for my specification. Instead, I’m going to use a state-based approach, like the one used by Natacha Crooks, Youer Pu, Lorenzo Alvisi and Allen Clement in their paper Seeing is Believing: A Client-Centric Specification of Database Isolation.

It’s important to remember that a specification is just a set of behaviors (series of state transitions). We’re going to use TLA+ to define the set of all of the behaviors that we consider valid for serializability. Another way to put that is that our specification is the set of all serializable executions.

We want to make sure that if we build an implementation, all of the behaviors permitted by the implementation are a subset of our serializability specification.

Note: Causality is not required

Here’s an example of an execution history that is serializable according to the definition:

This looks weird to us because the write happens after the read: T1 is reading data from the future!

But the definition of serializability places no constraints on the ordering of the transaction, for that you need a different isolation level: strict serializability. But we’re modeling serializability, not strict serializability, so we allow histories like the one above in our specification.

(I’d say “good luck actually implementing a system that can read events from the future”, but in distributed databases when you’re receiving updates from different nodes at different times, some pretty weird stuff can happen…)

If you’d like to follow along as we go, my Serializable TLA+ model is in the github repo (source, pdf).

Externally visible variables

My specification will generate operations (e.g., reads, writes, commits, aborts). The four externally visible variables in the specification are:

Variable nameDescription
opthe operation (read, write, commit, abort), modeled as a single letter: {“r”, “w”, “c”, “a”} )
argthe argument(s) to the operation
rvalthe return value of the operation
trthe transaction executing the operation

Here’s the serializable example from earlier:

The execution history shown above can be modeled as a TLA+ behavior like this:

Initial state of the specification

We need to specify the set of valid initial states. In the initial state of our spec, before any operations are issued, we determine:

  1. which transactions will commit and which will abort
  2. the order in which the transactions will occur
  3. the value of the environment for each committed transaction at the beginning and at the end of its lifetime

This is determined by using three internal variables whose values are set in the initial state:

VariableDescription
fatefunction which encodes which transactions commit and which abort
totransaction order
benvthe value of the environments at the beginning/end of each transaction

We couldn’t actually implement a system that could predict in advance whether a transaction will commit or abort, but it’s perfectly fine to use these for defining our specification.

The values of these variables are specified like this:

In our initial state, our specification chooses a fate, ordering, and begin/end environments for each transaction. Where Orderings is a helper operator:

As an example, consider a behavior with three transactions fated to commit, where the fated transaction order is:

  1. T2
  2. T3
  3. T1

Furthermore, assume the following starting environments for each transaction:

T1: [x=2, y=5, z=3]
T2: [x=0, y=0, z=0]
T3: [x=0, y=1, z=0]
Finally, assume that the final environment state (once T1 completes) is [x=2,y=5,z=1].

We can visually depict the committed transactions like like this:

Reads and writes

You can imagine each transaction running in parallel. As long as each transaction’s behavior is consistent with its initial environment, and it ends up with its final environment the resulting behavior will be serializable. Here’s an example.

Each transaction has a local environment, tenv. If the transaction is fated to commit, its tenv is initialized to its benv at the beginning:

where:

Here’s an example that shows how tenv for transaction T3 varies over time:

benv is fixed, but tenv for each transaction varies over time based on the writes

If the transaction is fated to abort, then we don’t track its environment in tenv, since any read or write is valid.

A valid behavior, as the definition of serializability places no constraints on the reads of an aborted transaction

Actions permitted by the specification

The specification permits the following actions:

  1. commit transaction
  2. abort transaction
  3. read a value
  4. write a value

I’m not modeling the start of a transaction, because it’s not relevant to the definition of serializability. We just assume that all of the transactions have already started.

In TLA+, we specify it like this:

Note that there are no restrictions here on the order in which operations happen. Even if the transaction order is [T2, T3, T1], that doesn’t require that the operations from T2 have to be issued before the other two transactions.

Rather, the only constraints for each transaction that will commit is that:

  1. Its reads must be consistent with its initial environment, as specified by benv.
  2. Its local environment must match the benv of the next transaction in the order when it finally commits.

We enforce (1) in our specification by using a transaction-level environment, tenv, for the reads. This environment gets initialized to benv for each transaction, and is updated if the transaction does any writes. This enables each transaction to see its own writes.

We enforce (2) by setting a precondition on the Commit action that it can only fire when tenv for that transaction is equal to benv of the next transaction:

Termination

If all of the transactions have committed or aborted, then the behavior is complete, which is modeled by the Termination sub-action, which just keeps firing and doesn’t change any of the variables:

Liveness

In our specification, we want to ensure that every behavior eventually satisfies the Termination action. This means that all transactions either eventually commit or abort in every valid behavior of the spec. In TLA+, we can describe this desired property like this:

The diamond is a temporal operator that means “eventually”.

To achieve this property, we need to specify a liveness condition in our specification. This is a condition of the type “something we want to happen eventually happens”.

We don’t want our transactions to stay open forever.

  1. For transactions that are fated to abort, they must eventually abort
  2. For transactions that are fated to commit, they must eventually commit

We’re going to use weak and strong fairness to specify our liveness conditions; for more details on liveness and fairness, see my post a liveness example in TLA+.

Liveness for aborts

We want to specify that everyone transaction that is fated to abort eventually aborts. To do this, we can use weak fairness.

This says that “the Abort action cannot be forever enabled without the Abort action happening”.

Here’s the Abort action.

The abort action is enabled for a transaction t if the transaction is in the open state, and its fate is Aborted.

Liveness for commits

The liveness condition for commit is more subtle. A transaction can only commit if its local environment (tenv) matches the starting environment of the transaction that follows it in transaction order (benv).

Consider two scenarios: one where tenv matches the next benv, and one where it doesn’t:

We want to use fairness to specify that every transaction fated to commit eventually reaches the state of scenario 1 above. Note that scenario 2 is a valid state in a behavior, it’s just not a state from which a commit can happen.

Consider the following diagram:

For every value of tenv[Ti], the number of variables that match the values in benv[i+1] is somewhere between 0 and 5. In the example above, there are two variables that match, x and z.

Note that the Commit action is always enabled when a transaction is open, so with every step of the specification, tenv can move left or right in the diagram above, with a min of 0 and a max of 5.

We need to specify “tenv always eventually moves to the right”. When tenv is at zero, we can use weak fairness to specify that it eventually moves from 0 to 1.

To specify this, I defined a function W(0, 1) which is true when tenv moves from 0 to 1:

Where M(env1, env2) is a count of the number of variables that have the same value:

This means we can specify “tenv cannot forever stay at 0” using weak fairness, like this:

We also want to specify that tenv eventually moves from 1 matches to 2, and then from 2 to 3, and so on, all of the way from 4 to all 5. And then we also want to say that it eventually goes from all matches to commit.

We can’t use weak fairness for this, because if tenv is at 1, it can also change to 0. However, the weak fairness of W(0,1) ensures that it if it goes from 1 down to 0, it will always eventually go back to 1.

Instead, we need to use strong fairness, which says that “if the action is enabled infinitely often, then the action must be taken”. We can specify strong fairness for each of the steps like this:

Recall that Obj is the set of objects {x, y, z, …}, and Cardinality refres to the size of the set. We also need to specify strong fairness on the commit action, to ensure that we eventually commit if all variables matching is enabled infinitely often:

Now putting it all together, here’s one way to specify the liveness condition, which is conventionally called L.

Once again, the complete model is in the github repo (source, pdf).

How do we know our spec is correct?

We can validate our serializable specification by creating a refinement mapping to a sequential specification. Here’s a simple sequential specification for a key-value store, Sequential.tla:

I’m not going to get into the details of the refinement mapping in this post, but you can find it at in the SerializabilityRefinement model (source, pdf).

OK, but how do you know that this spec is correct?

It’s turtles all of the way down! This is really the bottom in terms of refinement, I can’t think of an even simpler spec that we can use to validate this one.

However, one thing we can do is specify invariants that we can use to validate the specification, either with the model checker or by proof.

For example, here’s an invariant that checks whether each write has an associated read that happened before:

where:

But what happens if there’s no initial write? In that case, we don’t know what the read should be. But we do know that we don’t want to allow two successive reads to read different values, for example:

r[x,3], r[x,4]

So we can also specify this check as an invariant. I called that SuccessiveReads, you can find it in the MCSequential model (source, pdf).

The value of formalizing the specification

Now that we have a specification for Serializability, we can use it to check if a potential concurrency control implementation actually satisfies this specification.

That was my original plan for this blog post, but it got so long that I’m going to save that for a future blog post. In that future post, I’m going to model multi-version concurrency control (MVCC) and show how it fails to satisfy our serializability spec by having the model checker find a counterexample.

However, in my opinion, the advantage of formalizing a specification is that it forces you to think deeply about what it is that you’re specifying. Finding counter-examples with the model checker is neat, but the real value is the deeper understanding you’ll get.

If you don’t examine what worked, how will you know what works?

This is one of my favorite bits from fellow anglophone Québécois Norm McDonald:

Norm: not a lung expert

One of the goals I believe that we all share for post-incident work is to improve the system. For example, when I wrote the post Why I don’t like discussing action items during incident reviews, I understood why people would want to focus on action items: precisely because they share this goal of wanting to improve the system (As a side note, Chris Evans of incident.io wrote a response: Why I like discussing actions items in incident reviews). However, what I want to write about here is not the discussion of action items, but focusing on what went wrong versus what went right.

“How did things go right?”

How did things go right is a question originally posed by the safety researcher Erik Hollnagel, in his the safety paradigm that he calls Safety-II. The central idea is that things actually go right most of the time, and if you want to actually improve the system, you need to get a better understanding of how the system functions, which means you need to broaden your focus beyond the things that broke.

You can find an approachable introduction to Safety-II concepts in the EUROCONTROL white paper From Safety-I to Safety-II. Hollnagel’s ideas have been very influential in the resilience engineering community. As an example, check out my my former colleague Ryan Kitchens’s talk at SREcon Americas 2019: How Did Things Go Right? Learning More from Incidents.

It’s with this how did things go right lens that I want to talk a little bit about incident review.

Beyond “what went well”

Now, in most incident writeups that I’ve read, there is a “what went well” section. However, it’s typically the smallest section in the writeup, with maybe a few bullet points: there’s never any real detail there.

Personally, I’m looking for details like how an experienced engineer recognized the symptoms enough to get a hunch about where to look next, reducing the diagnostic time by hours. Or how engineers leveraged an operational knob that was originally designed for a different purpose. I want to understand how experts are able to do the work of effectively diagnosing problems, mitigating impact, and remediating the problem.

Narrowly, I want to learn this because I want to get this sort of working knowledge into other people’s heads. More broadly, I want to bring to light the actual work that gets done.

We don’t know how the system works

Safety researchers make a distinction between work-as-imagined and work-as-done. We think we understand how the day-to-day work gets done, but we actually don’t. Not really. To take an example from software, we don’t actually know how people really use the tooling to get their work done, and I can confirm this by being on-call for internal support for development tools in previous jobs. (“You’re using our tool to do what?” is not an uncommon reaction from the on-call person). People do things we never imagined, in both wonderful and horrifying ways (sometimes at the same time!).

We also don’t see all of the ways that people coordinate to get their work done. There are the meetings, the slack messages, the comments on the pull requests, but there’s also the shared understanding, the common knowledge, the stuff that everybody knows that everybody else knows, that enables people to get this work done, while reducing the amount of explicit communication that has to happen.

What’s remarkable is that these work patterns, well, they work. These people in your org are able to get their stuff done, almost all of the time. Some of them may exhibit mastery of the tooling, and others may use the tooling in ways even it was never intended that are fundamentally unsafe. But we’re never going to actually know unless we actually look at how they’re doing their work.

Because how people do their work is how the system works. And if we’re going to propose and implement interventions, it’s very likely that the outcomes of the interventions will surprise us, because these changes might disrupt effective ways of doing work, and people will adapt to those interventions in ways we never anticipated, and in ways we may never even know if we don’t take a look.

Then why use incidents to look at things that go right?

At first glance, it does seem odd to use incidents as the place to examine where work goes well: given that incidents are times where something unquestionably went wrong. It would be wonderful if we could study how work happens when things are going well. Heck, I’d love to see companies have sociologists or anthropologists on staff to study how the work happens at the company. Regrettably, though, incidents are one of the only times when the organization is actually willing to devote resources (specifically, time) on examining work in fine-grained detail.

We can use incidents to study how things go well, but we have to keep a couple of things in mind. One, we need to recognize that adaptations that fail led to an incident are usually successful, which is why people developed those adaptations. Note that because an adaptation usually works, doesn’t mean that it’s a good thing to keep doing: an adaptation could be a dangerous workaround to a constraint like a third-party system that can’t be changed directly and so must be awkwardly worked around.

Second, we need to look in more detail, to remark, at incident response that is remarkable. When incident response goes well, there is impressive diagnostic, coordination, and improvisation work to get the system back to healthy. These are the kinds of skills you want to foster across your organization. If you want to build tools to make this work even better, you should take the time to understand just how this work is done today. Keep this in mind when you’re proposing new interventions. After all, if you don’t examine what worked, how will you know what works?