Who’s afraid of serializability?

Kyle Kingsbury’s Jepsen recently did an analysis of PostgreSQL 12.3 and found that under certain conditions it violated guarantees it makes about transactions, including violations of the serializability transaction isolation level.

I thought it would be fun to use one of his counterexamples to illustrate what serializable means.

Here’s one of the counterexamples that Jepsen’s tool, Elle, found:

In this counterexample, there are two list objects, here named 1799 and 1798, which I’m going to call x and y. The examples use two list operations, append (denoted "a") and read (denoted "r").

Here’s my redrawing of the example. I’ve drawn all operations against x in blue and against y in red. Note that I’m using empty list ([]) instead of nil.

There are two transactions, which I’ve denoted T1 and T2, and each one involves operations on two list objects, denoted x and y. The lists are initially empty.

For transactions that use the serializability isolation model, all of the operations in all of the transactions have to be consistent with some sequential ordering of the transactions. In this particular example, that means that all of the operations have to make sense assuming either:

  • all of the operations in T1 happened before all of the operations in T2
  • all of the operations in T2 happened before all of the operations in T1

Assume order: T1, T2

If we assume T1 happened before T2, then the operations for x are:

      x = [] 
T1:   x.append(2)
T2:   x.read() → []

This history violates the contract of a list: we’ve appended an element to a list but then read an empty list. It’s as if the append didn’t happen!

Assume order: T2, T1

If we assume T2 happened before T1, then the operations for y are:

      y = []
T2:   y.append(4)
      y.append(5)
      y.read() → [4, 5]
T1:   y.read() → []

This history violates the contract of a list as well: we read [4, 5] and then [ ]: it’s as if the values disappeared!

Kingsbury indicates that this pair of transactions are illegal by annotating the operations with arrows that show required orderings. The "rw" arrow means that the read operation that happened in the tail must be ordered before the write operation at the head of the arrow. If the arrows form a cycle, then the example violates serializability: there’s no possible ordering that can satisfy all of the arrows.

Serializability, linearizability, locality

This example is a good illustration of how serializability differs from linearizability. Lineraizability is a consistency model that also requires that operations must be consistent with sequential ordering. However, linearizability is only about individual objects, where transactions refer to collections of objects.

(Linearizability also requires that if operation A happens before operation B in time, then operation A must take effect before operation B, and serializability doesn’t require that, but let’s put that aside for now).

This counterexample above is a linearizable history: we can order the operations such that they are consistent with the contracts of x and y. Here’s an example of a valid history, which is called a linearization:

x = []
y = []
x.read() → []
x.append(2)
y.read() → []
y.append(4)
y.append(5)
y.read() → [4, 5]

Note how the operations between the two transactions are interleaved. This is forbidden by transactional isolation, but the definition of linearizability does not take into account transactions.

This example demonstrates how it’s possible to have histories that are linearizable but not serializable.

We say that lineariazibility is a local property where serializability is not: by the definition of linearizability, we can identify if a history is linearizable by looking at the histories of the individual objects (x, y). However, we can’t do that for serializability.

This mess we’re in

Most real software systems feel “messy” to the engineers who work on them. I’ve found that software engineers tend to fall into one of two camps on the nature of this messiness.

Camp 1: Problems with the design

One camp believes that the messiness is primarily related to sub-optimal design decisions. These design decisions might simply be poor decisions, or they might be because we aren’t able to spend enough time getting the design right.

My favorite example of this school of thought can be found in the text of Leslie Lamport’s talk entitled The Future of Computing: Logic or Biology:

The best way to cope with complexity is to avoid it. Programs that do a lot for us are inherently complex. But instead of surrendering to complexity, we must control it. We must keep our systems as simple as possible. Having two different kinds of drag-and-drop behavior is more complicated than having just one. And that is needless complexity.

We must keep our systems simple enough so we can understand them.
And the fundamental tool for doing this is the logic of mathematics.

Leslie Lamport, The Future of Computing: Logic or Biology

Camp 2: The world is messy

Engineers in the second camp believe that reality is just inherently messy, and that mess ends up being reflected in software systems that have to model the world. Rich Hickey describes this in what he calls “situated programs” (emphasis mine)

And they [situated programs] deal with real-world irregularity. This is the other thing I think that’s super-critical, you know, in this situated programming world. It’s never as elegant as you think, the real-world.

And I talked about that scheduling problem of, you know, those linear times, somebody who listens all day, and the somebody who just listens while they’re driving in the morning and the afternoon. Eight hours apart there’s one set of people and, then an hour later there’s another set of people, another set. You know, you have to think about all that time. You come up with this elegant notion of multi-dimensional time and you’d be like, “oh, I’m totally good…except on Tuesday”. Why? Well, in the U.S. on certain kinds of genres of radio, there’s a thing called “two for Tuesday”. Right? So you built this scheduling system, and the main purpose of the system is to never play the same song twice in a row, or even pretty near when you played it last. And not even play the same artist near when you played the artist, or else somebody’s going to say, “all you do is play Elton John, I hate this station”.

But on Tuesday, it’s a gimmick. “Two for Tuesday” means, every spot where we play a song, we’re going to play two songs by that artist. Violating every precious elegant rule you put in the system. And I’ve never had a real-world system that didn’t have these kinds of irregularities. And where they weren’t important.

Rich Hickey, Effective Programs: 10 Years of Clojure

It will come as no surprise to readers of this blog that I fall into the second camp. I do think that sub-optimal design decisions also contribute to messiness in software systems, but I think those are inevitable because unexpected changes and time pressures are inescapable. This is the mess we’re in.

Asking the right “why” questions

In the [Cognitive Systems Engineering] terminology, it is more important to understand what a joint cognitive system (JCS) does and why it does it, than to explain how it does it. [emphasis in the original]

Erik Hollnagel & David D. Woods, Joint Cognitive Systems: Foundations of Cognitive Systems Engineering, p22

In my previous post, I linked to a famous essay by John Allspaw: The Infinite Hows (or, the Dangers Of The Five Whys). The main thrust of Allspaw’s essay can be summed up in this five word excerpt:

“Why?” is the wrong question.

As illustrated by the quote from Hollnagel & Woods at the top of this post, it turns out that cognitive systems engineering (CSE) is very big on answering “why” questions. Allspaw’s perspective on incident analysis is deeply influenced by research from cognitive systems engineering. So what’s going on here?

It turns out that the CSE folks are asking different kinds of “why” questions than the root cause analysis (RCA) folks. The RCA folks ask why did this incident happen? The CSE folks ask why did the system adapt the sorts of behaviors that contributed to the incident?

Those questions may sound similar, but they start from opposite assumptions. The RCA folks start with the assumption that there’s some sort of flaw in the system, a vulnerability that was previously unknown, and then base their analysis on identifying what that vulnerability was.

The CSE folks, on the other hand, start with the assumption that behaviors exhibited by the system developed through adaptation to existing constraints. The “why” question here is “why is this behavior adaptive? What purpose does it serve in the system?” Then they base the analysis on identifying attributes of the system such as constraints and goal conflicts that would explain why this behavior is adaptive.

This is one of the reasons why the CSE folks are so interested in incidents to begin with: because it can expose these kinds of constraints and conflicts that are part of the context of a system. It’s similar to how psychologists use optical illusions to study the heuristics that the human visual system employs: you look at the circumstances under which a system fails to get some insight into how it normally functions as well as it does.

“Why” questions can be useful! But you’ve got to ask the right ones.

The inevitable double bind

Here are three recent COVID-19 news stories:

The first two stories are about large organizations (the FDA, large banks) moving too slowly in order to comply with regulations. The third story is about the risks of the FDA moving too quickly.

Whenever an agent is under pressure to simultaneously act quickly and carefully, they are faced with a double-bind. If they proceed quickly and something goes wrong, they will be faulted for not being careful enough. If they proceed carefully and something goes wrong, they will be faulted for not moving quickly enough.

In hindsight, it’s easy to identify who wasn’t quick enough and who wasn’t careful enough. But if you want to understand how agents make these decisions, you need to understand the multiple pressures that agents experience, because they are trading these off. You also need to understand what information they had available at the time, as well as their previous experiences. I thought this observation of the behavior of the banks was particularly insightful.

But it does tell a more general story about the big banks, that they have invested so much in at least the formalities of compliance that they have become worse than small banks at making loans to new customers.

Matt Levine

Reactions to previous incidents have unintended consequences to the future. The conclusion to draw here isn’t that “the banks are now overregulated”. Rather, it’s that double binds are unavoidable: we can’t eliminate them by adding or removing regulations. There’s no perfect knob setting where they don’t happen anymore.

Once we accept that double binds are inevitable, we can shift of our focus away from just adjusting the knob and towards work that will prepare agents to make more effective decisions when they inevitably encounter the next double bind.

Rebrand: Surfing Complexity

You can’t stop the waves, but you can learn to surf.

Jon Kabat-Zinn

When I started this blog, my primary interests were around software engineering and software engineering research, and that’s what I mostly wrote about. Over time, I became more interested in complex systems that include software, sometimes referred to as socio-technical systems. That attracted me initially to chaos engineering, and, more recently, to learning from incidents and resilience engineering.

To reflect the more recent focus on complex systems, I decided to rebrand this blog Surfing Complexity. The term has two inspirations: the quote from Jon Kabat-Zinn at the top of this post, and the book title Surfing Uncertainty by Andy Clark. I also gave the blog a new domain name: surfingcomplexity.blog.

In my experience, software engineers recognize the challenge of complexity, but their primary strategy for addressing complexity is by trying to reduce it (and, when they don’t have the resources to do so, complaining about it). By contrast, the resilience engineering community recognizes that complexity is inevitable in the adaptive universe, and seek to understand what we can do to navigate complexity more effectively.

While I think that we should strive to reduce complexity where possible, I also believe that most strategies for increasing the robustness or safety in a system lead will ultimately lead to an increase in complexity. As an example, consider an anti-lock braking system in a modern car. It’s a safety feature, but it clearly increases the complexity of the automobile.

I really like Kabat-Zinn’s surfing metaphor, because it captures the idea that complexity is inevitable: getting rid of it isn’t an option. However, we can get better at dealing with it.

Rehabilitating “you can’t manage what you can’t measure”

There’s a management aphorism that goes “you can’t manage what you can’t measure”. It is … controversial. W. Edwards Deming, for example, famously derided it. But I think there are two ways to interpret this quote, and they have very different takeaways.

One way to read this is to treat the word measure as a synonym for quantify. When John Allspaw rails against aggregate metrics like mean time to resolve (MTTR), he is siding with Deming in criticizing the idea of relying solely on aggregate, quantitative metrics for gaining insight into your system.

But there’s another way to interpret this aphorism, and it depends on an alternate interpretation of the word measure. I think that observing any kind of signal is a type of measurement. For example, if you’re having a conversation with someone, and you notice something in their tone of voice or their facial expression, then you’re engaged in the process of measurement. It’s not quantitative, but it represents information you’ve collected that you didn’t have before.

By generalizing the concept of measurement, I would recast this aphorism as: what you aren’t aware of, you can’t take into account.

This may sound like a banal observation, but the subtext here is “… and there’s a lot you aren’t taking into account.” A lot of things that are happening in your organization, your system, are largely invisible. And those things, that work, is keeping things up and running.

The concept that there’s invisible work happening that’s creating your availability is at the heart of the learning from incidents in software movement. And it isn’t obvious, even though we all experience it directly.

This invisible work is valuable in the sense that it’s contributing to keeping your system healthy. But the fact that it’s invisible is dangerous because it can’t be taken into account when decisions are made that change the system. For example, I’ve seen technological changes that have made it more difficult for the incident management team to diagnose what’s happening in the system. The teams who introduced those changes were not aware of how the folks on the incident management team were doing diagnostic work.

In particular, one of the dangers of an action-item-oriented approach to incident reviews is that you may end up introducing a change to the system that disrupts this invisible work.

Take the time to learn about the work that’s happening that nobody else sees. Because if you don’t see it, you may end up breaking it.

An old lesson about a fish

Back when I was in college [1], I was required to take several English courses. I still remember an English professor handing out an excerpt from the book ABC of Reading by Ezra Pound [2]:

No man is equipped for modern thinking until he has understood the anecdote of Agassiz and the fish:

A post-graduate student equipped with honours and diplomas went to Agassiz to receive the final and finishing touches. The great man offered him a small fish and told him to describe it.

Post-Graduate Student: ‘That’s only a sunfish.’

Agassiz: ‘I know that. Write a description of it.’

After a few minutes the student returned with the description of the Ichthus Heliodiplodokus, or whatever term is used to conceal the common sunfish from vulgar knowledge, family of Heliichtherinkus, etc., as found in textbooks of the subject.

Agassiz again told the student to describe the fish.

The student produced a four-page essay. Agassiz then told him to look at the fish. At the end of three weeks the fish was in an advanced state of decomposition, but the student knew something about it.

I remember my eighteen-year-old self hating this anecdote. It sounded like Agassiz just wasted the graduate student’s time, leaving him with nothing but a rotting fish for his troubles. As an eventual engineering major, I had no interest in the work of analyzing texts that was required in English courses. I thought such analysis was a waste of time.

It would take about two decades for the lesson of this anecdote to sink into my brain. The lesson I eventually took away from it is that there is real value in devoting significant effort to close study of an object. If you want to really understand something, a casual examination just won’t do.

To me, this is the primary message of the learning from incidents in software movement. Doing an incident investigation, like studying the fish, will take time. Completing an investigation may take weeks, even months. Keep in mind, though, that you aren’t really studying an incident at all: you’re studying your system through the lens of an incident. And, even though the organization will have long since moved on, once you’re done, you’ll know something about your system.

[1] Technically it was CEGEP, but nobody outside of Quebec knows what that is.

[2] Pound is likely retelling an anecdote originally told by either Nathaniel Shaler or Samuel Hubbard Scudder, both of whom were students of Agassiz.

There is no escape from the adaptive universe

If I had to pick just one idea from the field of resilience engineering that has influenced me the most, it would be David Woods’s notion of the adaptive universe. In his 2018 paper titled The theory of graceful extensibility: basic rules that govern adaptive systems, Woods describes the two assumptions [1] of the adaptive universe:

  1. Resources are always finite.
  2. Change is ongoing.

That’s it! Just two simple assertions, but so much flows from them.

At first glance, the assumptions sound banal. Nobody believes in infinite resources! Nobody believes that things will stop changing! Yet, when we design our systems, it’s remarkable how often we don’t take these into account.

The future is always going to involve changes to our system that we could not foresee at design time, and those changes are always going to be made in a context where we are limited in resources (e.g., time, headcount) and hence will have to make tradeoffs. Instead, we tell ourselves a story about how next time, we’re going to build it right. But, we aren’t, because the next time we’ll also be resource constrained, and so we’ll have to make some decisions for reasons of expediency. And the next time, the system will also change in ways we could never have predicted, invalidating our design assumptions.

Because we are forever trapped in the adaptive universe.

[1] If you watch Woods’s online resilience engineering short course, which precedes this paper, he mentions a third property: surprise is fundamental. But I think this property is a consequence of the first two assumptions rather than requiring an additional assumption, and I suspect that’s why he doesn’t mention it as an assumption in his 2018 paper.

There is no escape from Ashby’s Law

[V]ariety can destroy variety

W. Ross Ashby

There are more things in heaven and earth, Horatio,
Than are dreamt of in your philosophy.

Hamlet (1.5.167-8)

In his book An Introduction to Cybernetics, published in 1956, the English psychiatrist W. Ross Ashby proposed the Law of Requisite Variety. His original formulation isn’t easy to extract into a blog post, but the Principia Cybernetica website has a pretty good definition:

The larger the variety of actions available to a control system, the larger the variety of perturbations it is able to compensate.

Like many concepts in systems thinking, the Law of Requisite Variety is quite abstract, which makes it hard to get a handle on. Here’s a concrete example I find useful for thinking about it.

Imagine you’re trying to balance a broomstick on your hand:

This is an inherently unstable system, and so you have to keep moving your hand around to keep the broomstick balanced, but you can do it. You’re acting as a control system to keep the broomstick up.

If you constrain the broomstick to have only one degree of freedom, you have what’s called the inverted pendulum problem, which is a classic control systems problem. Here’s a diagram:

From the Wikipedia Inverted pendulum article

The goal is to move the cart in order to keep the pendulum balanced. If you have sensor information that measures the tilt angle, θ, you can use that data to build a control system to push on the cart in order to keep the pendulum from falling over. Information about the tilt angle is part of the model that the control system has about the physical system it’s trying to control.

Now, imagine that the pendulum isn’t constrained to only one degree of freedom, but it now has two degrees of freedom: this is the situation when you’re balancing a broom on your hand. There are now two tilt angles to worry about: it can fall towards/away from your body, or it can fall left/right.

You can’t use the original inverted pendulum control system to solve this problem, because that only models one of the tilt angles. Imagine you can only move your hand forward and back, but not left or right. Because of this, the control system won’t be able to correct for the other angle: the pendulum will fall over.

The problem is that the new system can vary in ways that the control system wasn’t designed to handle: it can get into states that aren’t modeled by the original system.

This is what the Law of Requisite Variety is about: if you want to build a control system, the control system needs to be able to model every possible state that the system being controlled can get into: the state space of the control system has to be at least as large as the state space of the physical system. If it isn’t, then the physical system can get into states that the control system won’t be able to deal with.

Bringing this into the software world: when we build infrastructure software, we’re invariably building control systems. These control systems can only handle situations that it is designed for. We invariably run into trouble when the systems we build get into states that the designer never imagined happening. A fun example of this case is some pathological traffic pattern.

The fundamental problem with building software control systems is that we humans aren’t capable of imagining all possible states that the systems being controlled can get into. In particular, we can’t imagine the changes that people are going to make in the future that will create new states that we simply could not ever imagine needing to handle. And so, our control systems will invariably be inadequate, because they won’t be able to handle these situations. The variety of the world exceeds the variety our control systems are designed to handle.

Fortunately, we humans are capable of conceiving of a much wider variety of system states than the systems we build. That’s why, when our software-based control systems fail and the humans get paged in, the humans are eventually able to make sense of what state the system has gotten itself into and put things right.

Even we humans are not exempt from Ashby’s Law. But we can revise of our (mental) models of the system in ways that our software-based control systems cannot, and that’s why we can deal effectively with incidents. Because of how we can update our models, we can adapt where software cannot.

The downsides of expertise

I’m a strong advocate of the value of expertise to a software organization. I’d even go so far as to say that expertise is a panacea.

Despite the value of expertise, there are two significant obstacles to organizations to leverage expertise as effectively as possible.

Expertise is expensive to acquire

Developing expertise is expensive for an organization to acquire. Becoming an expert requires experience, which takes time and effort. An organization can hire for some forms of expertise, but no organization can hire someone who is already an expert in the org’s socio-technical system. And a lot of the value for an organization is having expertise in the behaviors of the local system.

You can transfer expertise from one person to another, but that also takes time and effort, and you need to put mechanisms in place to support this. Apprenticeship and coaching are two traditional methods of expertise transfer, but also aren’t typically present in software organizations. I’m an advocate of learning from incidents as a medium for skill transfer, but that requires its own expertise for doing incident investigation in a way that supports skill transfer.

Alas, you can’t transfer expertise from a person to a tool, as John Allspaw notes, so we can’t take a shortcut by acquiring sophisticated tooling. AI researchers tried building such expert systems in the 1980s, but these efforts failed.

Concentrated expertise is dangerous

Organizations tend to foster local experts: a small number of individuals who have a lot of expertise with aspects of the local system. These people are enormously valuable to organizations (they’re often very helpful during incidents), but they represent single points of failure. If these individuals happen to be out of the office during a critical incident, or if they leave the company, it can be very costly to the organization. My former colleague Nora Jones calls this the islands of knowledge problem.

What’s worse, high concentration of expertise can become a positive feedback loop. If there’s a local expert, then other individuals may use the expert as a crutch, relying on the expert to solve the harder problems and never putting in the effort to develop their own expertise.

To avoid this problem, we need to develop the expertise in more people within the organization, which, is as mentioned earlier, is expensive.

I continue to believe that it’s worth it.