Modeling B-trees in TLA+

I’ve been reading Alex Petrov’s Database Internals to learn more about how databases are implemented. One of the topics covered in the book is a data structure known as the B-tree. Relational databases like Postgres, MySQL, and sqlite use B-trees as the data structure for storing the data on disk.

I was reading along with the book as Petrov explained how B-trees work, and nodding my head. But there’s a difference between feeling like you understand something (what the philosopher C. Thi Nguyen calls clarity) and actually understanding it. So I decided to model B-trees using TLA+ as an exercise in understanding it better.

TLA+ is traditionally used for modeling concurrent systems: Leslie Lamport originally developed it to help him reason about the correctness of concurrent algorithms. However, TLA+ works just fine for sequential algorithms as well, and I’m going to use it here to model sequential operations on B-trees.

What should it do? Modeling a key-value store

Before I model the B-tree itself, I wanted to create a description of its input/output behavior. Here I’m going to use B-trees to implement a key-value store, where the key must be unique, and the key type must be sortable.

The key-value store will have four operations:

Get a value by key (key must be present)
Insert a new key-value pair (key must not be present)
Update the value of an existing pair (key must be present)
Delete a key-value pair by key. (It is safe to delete a key that already exists).

In traditional programming languages, you can define interfaces with signatures where you specify the types of the input and the types of the output for each signature. For example, here’s how we can specify an interface for the key-value store in Java.

interface KVStore<K extends Comparable<K>,V> {
 V get(K key) throws MissingKeyException;
 void insert(K key, V val) throws DuplicateKeyException;
 void update(K key, V val) throws MissingKeyException;
 void delete(K key);
}

However, Java interfaces are structural, not behavioral: they constrain the shape of the inputs and outputs. What we want is to fully specify what the outputs are based on the history of all previous calls to the interface. Let’s do that in TLA+.

Note that TLA+ has no built-in concept of a function call. If you want to model a function call, you have to decide on your own how you want to model it, using the TLA+ building blocks of variables.

I’m going to model function calls using three variables: op, args, and ret:

op – the name of the operation (function) to execute (e.g., get, insert, update, delete)
args – a sequence (list) of arguments to pass to the function
ret – the return value of the function

We’ll model a function call by setting the op and args variables. We’ll also set the ret variable to a null-like value (I like to call this NIL).

Modeling an insertion request

For example, here’s how we’ll model making an “insert” call (I’m using ellipsis … to indicate that this isn’t the full definition), using a definition called InsertReq (the “Req” is short for request).

InsertReq(key, val) ==
    /\ op' = "insert"
    /\ args' = <<key, val>>
    /\ ret' = NIL
    /\ ...

Here’s a graphical depiction of InsertReq, when passed the argument 5 for the key and X for the value. (assume X is a constant defined somewhere else).

The diagram above shows two different states. In the first state, op, args, and ret are all set to NIL. In the second state, the values of op and args have changed. In TLA+, a definition like InsertReq is called an action.

Modeling a response: tracking stores keys and values

We’re going to define an action that completes the insert operation, which I’ll call InsertResp. Before we do that, we need to cover some additional variables that we need to track. The op, args, ret model the inputs and outputs, but there are some additional internal state we need to track.

The insert function only succeeds when the key isn’t already present. This means we need to track which keys have been stored. We’re also going to need to track which values have been stored, since the get operation will need to retrieve those values.

I’m going to use a TLA+ function to model which keys have been inserted into the store, and their corresponding values. I’m going to call that function dict, for dictionary, since that’s effectively what it is.

I’m going to model this so that the function is defined for all possible keys, and if the key is not in the store, then I’ll use a special constant called MISSING.

We need to define the set of all possible keys and values as constants, as well as our special constant MISSING, and our other special constant, NIL.

This is also a good time to talk about how we need to initialize all of our variables. We use a definition called Init to define the initial values of all of the variables. We initialize op, args, and ret to NIL, and we initialize dict so that all of the keys map to the special value MISSING.

CONSTANTS Keys, Vals, MISSING, NIL
ASSUME MISSING \notin Vals

Init ==
    /\ op = NIL
    /\ args = NIL
    /\ ret = NIL
    /\ dict = [k \in Keys |-> MISSING]
    /\ ...

We can now define the InsertResp action, which updates the dict variable and the return variable.

I’ve also defined a couple of helpers called Present and Absent to check whether a key is present/absent in the store.

Present(key) == dict[key] \in Vals
Absent(key) == dict[key] = MISSING

InsertResp == LET key == args[1]
                  val == args[2] IN
       /\ op = "insert"
       /\ dict' = IF Absent(key)
                  THEN [dict EXCEPT ![key] = val]
                  ELSE dict
       /\ ret' = IF Absent(key) THEN "ok" ELSE "error"
       /\ ...

The TLA+ syntax around functions is a bit odd:

[dict EXCEPT ![key] = val]

The above means “a function that is identical to dict, except that dict[key]=val“.

Here’s a diagram of a complete entire insert operation. I represented a function in my diagram as a set of ordered pairs, where I only explicitly show the one that changes in the diagram.

What action can come next?

As shown in the previous section, the TLA+ model I’ve written requires two state transitions to implement a function call. In addition, My model is sequential, which means that I’m not allowing a new function call to start until the previous one is completed. This means I need to track whether or not a function call is currently in progress.

I use a variable called state, which I set to “ready” when there are no function calls currently in progress. This means that we need to restriction actions that initiate a call, like InsertReq, to when the state is ready:

InsertReq(key, val) ==
    /\ state = "ready"
    /\ op' = "insert"
    /\ args' = <<key, val>>
    /\ ret' = NIL
    /\ state' = "working"
    /\ ...

Similarly, we don’t want a response action like InsertResp to fire unless the corresponding request has fired first, so we need to ensure both the op and the state match what we expect:

Present(key) == dict[key] \in Vals
Absent(key) == dict[key] = MISSING

InsertResp == LET key == args[1]
                  val == args[2] IN
       /\ op = "insert"
       /\ state = "working"
       /\ dict' = IF Absent(key)
                  THEN [dict EXCEPT ![key] = val]
                  ELSE dict
       /\ ret' = IF Absent(key) THEN "ok" ELSE "error"
       /\ state' = "ready"

You can find the complete behavioral spec for my key-value store, kvstore.tla, on github.

Modeling a B-tree

A B-tree is similar to a binary search tree, except that each non-leaf (inner) node has many children, not just two. In addition, the values of are only stored in the leaves, not in the inner nodes. (This is technically a B+ tree, but everyone just calls it a B-tree).

Here’s a diagram that shows part of a B-tree

There are two types of nodes, inner nodes and leaf nodes. The inner nodes hold keys and pointers to other nodes. The leaf nodes hold keys and values.

For an inner node, each key, k points to a tree that contains keys that are strictly less than k. In the example above, the node 57 at the root points to a tree that contains keys less than 57, the node 116 points to keys between 57 (inclusive) and 116 exclusive, and so on.

Each inner node also has an extra pointer, which points to a tree that contains keys larger than the largest key. In the example above, the root’s extra pointer points to a tree that contains keys greater than or equal to 853.

The leaf nodes contain key-value pairs. One leaf node is shown in the example above, where key 19 corresponds to value A, key 21 corresponds to value B, etc.

Modeling the nodes and relations

If I was implementing a B-tree in a traditional programming language, I’d represent the nodes as structs. Now, while TLA+ does have structs, I didn’t use them for modeling a B tree. Instead, following Hillel Wayne’s decompose functions of structs tip, I used TLA+ functions instead.

(I actually started with trying to model it with structs, then I switched to functions, and then later found Hillel Wayne’s advice.)

I also chose to model the nodes and keys as natural numbers, so I could easily control the size of the model by specifying maximum values for keys and nodes. I didn’t really care about changing the size of the set of values as much, so I just modeled them as a set of model values.

In TLA+. that looks like this:

CONSTANTS Vals, MaxKey,MaxNode, ...

Keys == 1..MaxKey
Nodes == 1..MaxNode

Here are the functions I defined:

isLeaf[node] – true if a node is a leaf
keysOf[node] – the set of keys associated with a node (works for both inner and leaf nodes)
childOf[node, key] – given a node and a key, return the child node it points to (or NIL if it doesn’t point to one)
lastOf[Node] – the extra pointer that an inner node points to
valOf[node, key] – the value associated with a key inside a leaf node

I also had a separate variable for tracking the root, unimaginatively called root.

When modeling in TLA+, it’s helpful to define a type invariant. Here’s the relevant part of the type invariant for these functions.

TypeOk == /\ root \in Nodes
          /\ isLeaf \in [Nodes -> BOOLEAN]
          /\ keysOf \in [Nodes -> SUBSET Keys]
          /\ childOf \in [Nodes \X Keys -> Nodes \union {NIL}]
          /\ lastOf \in [Nodes -> Nodes \union {NIL}]
          /\ valOf \in [Nodes \X Keys -> Vals \union {NIL}]
          /\ ...

Modeling inserts

When inserting an element into a B-tree, you need to find which leaf it should go into. However, nodes are allocated a fixed amount of space, and as you add more data to a node, the amount of occupied space, known as occupancy, increases. Once the occupancy reaches a limit (max occupancy), then the node has to be split into two nodes, with half of the data copied into the second node.

When a node splits, its parent node needs a new entry to point to the new node. This increases the occupancy of the parent node. If the parent is at max occupancy, it needs to split as well. Splitting can potentially propagate all of the way up to the root. If the root splits, then a new root node needs to be created on top of it.

This means there are a lot of different cases we need to handle on insert:

Is the target leaf at maximum occupancy?
If so, is its parent at maximum occupancy? (And on, and on)
Is its parent pointer associated with a key, or is its parent pointer that extra (last) pointer?
After the node splits, should the key/value pair be inserted into the original node, or the new one?

In our kvstore model, the execution of an insertion was modeled in a single step. For modeling the insertion behavior, I used multiple steps. To accomplish this, I defined a state variable, that can take on the following values during insertion:

FIND_LEAF_TO_ADD – identify which leaf the key/value pair should be inserted in
WHICH_TO_SPLIT – identify which nodes require splitting
SPLIT_LEAF – split a leaf node
SPLIT_INNER – split an inner node
SPLIT_ROOT_LEAF – split the root (when it is a leaf node)
SPLIT_ROOT_INNER – split the root (when it is an inner node)

I also defined two new variables:

focus – stores the target leaf to add the data to
toSplit – a sequence (list) of the nodes in the chain from the leaf upwards that are at max occupancy and so require splitting.

I’ll show two of the actions, the one associated with FIND_LEAF_TO_ADD and WHICH_TO_SPLIT states.

FindLeafToAdd ==
    LET key == args[1]
        leaf == FindLeafNode(root, key)
    IN /\ state = FIND_LEAF_TO_ADD
       /\ focus' = leaf
       /\ toSplit' = IF AtMaxOccupancy(leaf) THEN <<leaf>> ELSE <<>>
       /\ state' = IF AtMaxOccupancy(leaf) THEN WHICH_TO_SPLIT ELSE ADD_TO_LEAF
       /\ UNCHANGED <<root, isLeaf, keysOf, childOf, lastOf, valOf, args, op, ret>>

WhichToSplit ==
    LET  node == Head(toSplit)
         parent == ParentOf(node)
         splitParent == AtMaxOccupancy(parent)
         noMoreSplits == ~splitParent  \* if the parent doesn't need splitting, we don't need to consider more nodes for splitting
    IN /\ state = WHICH_TO_SPLIT
       /\ toSplit' =
           CASE node = root   -> toSplit
             [] splitParent   -> <<parent>> \o toSplit
             [] OTHER         -> toSplit
       /\ state' =
            CASE node # root /\ noMoreSplits /\ isLeaf[node]  -> SPLIT_LEAF
              [] node # root /\ noMoreSplits /\ ~isLeaf[node] -> SPLIT_INNER
              [] node = root /\ isLeaf[node]                  -> SPLIT_ROOT_LEAF
              [] node = root /\ ~isLeaf[node]                 -> SPLIT_ROOT_INNER
              [] OTHER                                        -> WHICH_TO_SPLIT
       /\ UNCHANGED <<root, isLeaf, keysOf, childOf, lastOf, valOf, op, args, ret, focus>>

You can see that the WhichToSplit action starts to get a bit hairy because of the different cases.

Allocating new nodes

When implementing a B-tree, we need to dynamically allocate a new node when an existing node gets split, and when the root splits and we need a new root.

The way I modeled this was to through the concept of a free node. I modeled the set of Nodes as a constant, and I treated some nodes as part of the existing tree, and other nodes as free nodes. When I needed a new node, I used one of the free ones.

I could have defined an isFree function, but instead I considered a node to be free if it was a leaf (which all nodes are set at initialized to in Init) without any keys (since all leaves that are part of the tree.

\* We model a "free" (not yet part of the tree) node as one as a leaf with no keys
IsFree(node) == isLeaf[node] /\ keysOf[node] = {}

Sanity checking with invariants

I defined a few invariants to check that the btree specification was behaving as expected. These caught some errors in my model as I went along. I also had an invariant that checks if there are any free nodes left. If not, it meant I needed to increase the number of nodes in the model relative to the number of keys.

\*
\* Invariants
\*
Inners == {n \in Nodes: ~isLeaf[n]}

InnersMustHaveLast == \A n \in Inners : lastOf[n] # NIL
KeyOrderPreserved == \A n \in Inners : (\A k \in keysOf[n] : (\A kc \in keysOf[childOf[n, k]]: kc < k))
LeavesCantHaveLast == \A n \in Leaves : lastOf[n] = NIL
KeysInLeavesAreUnique ==
    \A n1, n2 \in Leaves : ((keysOf[n1] \intersect keysOf[n2]) # {}) => n1=n2
FreeNodesRemain == \E n \in Nodes : IsFree(n)

Partially complete model (no deletions)

You can find the full model in the btree.tla file in the GitHub repo. Note that I didn’t get around to modeling deletions. I was happy with just get/insert/update, and I figured that it would be about as much work as modeling inserts, which was quite a bit.

Does our model match the behavioral spec?

Recall that we stored by defining a behavioral specification for a key-value store. We can check that the B-Tree implements this behavioral specification by defining a refinement mapping.

My btree model uses many of the same variables as the kvstore model, but there are two exceptions:

the kvstore model has a dict variable (a function that maps keys to values) which the btree model doesn’t.
Both models have a variable named state, but in the kvstore model this variable only take the values “ready” and “working”, whereas in the btree model it take on other values, because the btree model has finer-grained states.

This means we need to define mappings from the variables in the btree model to the variables in the kvstore model. Here’s how I did it:

Mapping == INSTANCE kvstore WITH
  dict <- [key \in Keys |-> 
            IF \E leaf \in Leaves : key \in keysOf[leaf] 
            THEN LET leaf == CHOOSE leaf \in Leaves : key \in keysOf[leaf] 
                 IN valOf[leaf, key] ELSE MISSING],
  state <- IF state = READY THEN "ready" ELSE "working"

Refinement == Mapping!Spec

For the other variables (op, args, ret), the default mapping works just fine.

We can then use the model checker (TLC) to check the Refinement property. I’m using the TLA+ for VS Code plugin, which drives the model checker using config files. I added this to my btree.cfg to tell it to check the “Refinement” temporal property defined above.

PROPERTY
Refinement

By doing this, the model checker will verify that:

every valid initial state for the btree model is also a valid initial state for the kvstore model
every valid state transition for the btree model is also a valid state transition for the kvstore model.

And, indeed, the model checker succeeds!

Hey, why did the model checker succeed when you didn’t implement delete?

You’ll note that the original kvstore model handles a delete operation, but my btree model doesn’t. The reason that checking the refinement property didn’t pick up in this is that it’s valid behavior under the definition of the kvstore model for it to never actually call delete.

In the formal methods jargon, the btree spec does not violate any of the safety properties of the kvstore spec (i.e., the btree spec never makes a state transition that would be disallowed by the kvstore spec). But to check that btree implements the delete function, we need what’s called a liveness property.

We can define a liveness property by altering the kvstore model to require that delete operations be performed, by specifying fairness in our kvstore model.

Here’s the general form of a TLA+ specification, where you define an initial state predicate (Init) and a next-state predict (Next). It looks like this.

Spec == Init /\ [][Next]_vars

For the kvstore model, I added an additional fairness clause.

Spec == Init /\ [][Next]_vars /\ WF_op(\E k \in Keys: DeleteReq(k))

Informally, that extra clause says, “if the model ever reaches a state where it can run a delete operation, it should eventually perform the operation”. (More formally, it says that: If the predicate \E k \in Keys: DeleteReq(k) becomes forever enabled, then there must be a step (state transition) where the op variable changes state.)

Effectively, it’s now part of the kvstore specification that it perform delete operations. When I then run the model checker, I get the error: “Temporal properties were violated.“

Choosing the grain of abstraction

One of the nice things about TLA+ is that you can choose how granular you want your model to be. For example, in my model, I assumed that both inner and leaf nodes had occupancy determined by the number of elements. In a real B-tree, that isn’t true. The data stored in the inner nodes is a list of (key, pointer) pairs, which are fixed size. But the actual values, the data stored, is variable size (for example, consider variable-sized database records like varchars and blobs).

Managing the storage is more complex within the leaves than in the inner nodes, and Petrov goes into quite a bit of detail on the structure of a B-tree leaf in his book. I chose not to model these details in my btree spec: The behavior of adding and deleting the data within a leaf node is complex enough that, if I were to model the internal behavior of leaves, I’d use a separate model that was specific to leaf behavior.

Formal modeling as a tool of thought

I found this a useful exercise for understanding how B-trees work. It was also helpful to practice my TLA+ skills. TLA+ allows you to abstract away all of the details that you consider irrelevant. However, for whatever remains, you need to specify things exhaustively: precisely how every variable changes, in every possible transition. This can be tedious, but it also forces you to think through each case.

You can find the TLA+ files at github.com/lorin/btree-tla.

The problem with invariants is that they change over time

Cliff L. Biffle blogged a great write-up of a debugging odyssey at Oxide with the title Who killed the network switch? Here’s the bit that jumped out at me:

At the time that code was written, it was correct, but it embodied the assumption that any loaned memory would fit into one region.

That assumption became obsolete the moment that Matt implemented task packing, but we didn’t notice. This code, which was still simple and easy to read, was now also wrong.

This type of assumption is an example of an invariant, a property of the system that is supposed to be guaranteed to not change over time. Invariants play an important role in formal methods (for example, see the section Writing an invariant in Hillel Wayne’s Learn TLA+ site).

Now, consider the following:

Our systems change over time. In particular, we will always make modifications to support new functionality that we could not have foreseen earlier in the lifecycle of the system.
Our code often rests on a number of invariants, properties that are currently true of our system and that we assume will always be true.
These invariants are implicit: the assumptions themselves are not explicitly represented in the source code. That means there’s no easy way to, say, mechanically extract them via static analysis.
A change can happen that violates an assumed invariant can be arbitrary far away from code that depends on the invariant to function properly.

What this means is that these kinds of failure modes are inevitable. If you’ve been in this business long enough, you’ve almost certainly run into an incident where one of the contributors was an implicit invariant that was violated by a new change. If you’re system lives long enough, it’s going to change. And one of those changes is eventually going to invalidate an assumption that somebody made long ago, which was a reasonable assumption to make at the time.

Implicit invariants are, by definition, impossible to enforce explicitly. They are time bombs. And they are everywhere.

On productivity metrics and management consultants

The management consulting firm McKinsey & Company recently posted a blog post titled Yes, you can measure software developer productivity. The post prompted a lot of responses, such as Kent Beck and Gergely Orosz’s Measuring developer productivity? A response to McKinsey, Dan North’s The Worst Programmer I Know, and John Cutler’s The Ultimate Guide to Developer Counter-Productivity.

Now, I’m an avowed advocate of qualitative approaches to studying software development, but I started out my academic research career on the quantitative side, doing research into developer productivity metrics. And so I started to read the McKinsey post with the intention of writing a response, on why qualitative approaches are better for gaining insight into productivity issues. And I hope to write that post soon. But something jumped out at me that changed what I wanted to write about today. It was this line in particular (emphasis mine):

For example, one company that had previously completed a successful agile transformation learned that its developers, instead of coding, were spending too much time on low-value-added tasks such as provisioning infrastructure, running manual unit tests, and managing test data. Armed with that insight, it launched a series of new tools and automation projects to help with those tasks across the software development life cycle.

I realized that I missed the whole point of this post. The goal isn’t to gain insight, it’s to justify funding a new program inside an organization.

In order to effect change in an organization, you need political capital, even if you’re an executive. That’s because making change in an organization is hard, programs are expensive and don’t bear fruit for a long time, and so you need to get buy-in in order to make things happen.

McKinsey is a management consulting firm. One of the services that management consulting firms provide is that they will sell you political capital. They provide a report generated by external experts that their customers can use as leverage within their organizations to justify change programs.

As Lee Clarke describes in his book Mission Improbable: Using Fantasy Documents to Tame Disaster, technical reports written by experts have rhetorical, symbolic power, even if the empirical foundations of the reports are weak. Clarke’s book focuses on the unverified nature of disaster recovery documents, but the same holds true for reports based on software productivity metrics.

If you want to institute a change to a software development organization, and you don’t have the political capital to support it, then building a metrics program that will justify your project is a pretty good strategy if you can pull that off: if you can define metrics that will support the outcome that you want, and you can get the metrics program in place, then you can use it as ammunition for the new plan. (“We’re spending too much time on toil, we should build out a system to automate X”).

Of course, this sounds extremely cynical. You’re creating a metrics program where you know in advance what the metrics are going to show, with the purpose of justifying a new program you’ve already thought of? You’re claiming that you want to study a problem when you already have a proposed solution in the wings! But this is just how organizations work.

And, so, it makes perfect sense that McKinsey & Company would write a blog post like this. They are, effectively, a political-capital-as-a-service (PCaaS?) company. Helping executives justify programs inside of companies is what they do for a living. But they can’t simply state explicitly how the magic trick actually works, because then it won’t work anymore.

The danger is when the earnest folks, the ones who are seeking genuine insight into the nature of software productivity issues in their organization, read a post like this. Those are the ones I want to talk to about the value of a qualitative approach for gaining insight.

Why LFI is a tough sell

There are two approaches to doing post-incident analysis:

the (traditional) root cause analysis (RCA) perspective
the (more recent) learning from incidents (LFI) perspective

In the RCA perspective, the occurrence of an incident has demonstrated that there is a vulnerability that caused the incident to happen, and the goal of the analysis is to identify and eliminate the vulnerability.

In the LFI perspective, an incident presents the organization with an opportunity to learn about the system. The goal is to learn as much as possible with the time that the organization is willing to devote to post-incident work.

The RCA approach has the advantage of being intuitively appealing. The LFI approach, by contrast, has three strikes against it:

LFI requires more time and effort than RCA
LFI requires more skill than RCA
It’s not obvious what advantages LFI provides over RCA.

I think the value of LFI approach is based on assumptions that people don’t really think about because these assumptions are not articulated explicitly.

In this post, I’m going to highlight two of them.

Nobody knows how the system really works

The LFI approach makes the following assumption: No individual in the organization will ever have an accurate mental model about how the entire system works. To put it simply:

It’s the stuff we don’t know that bites us
There’s always stuff we don’t know

By “system” here, I mean the socio-technical system, which includes both the software and what it does, and the humans who do the work to develop and operate the system.

You’ll see the topic of incorrect mental models discussed in the safety literature in various ways. For example, David Woods uses the term miscalibration to describe incorrect mental models, and Diane Vaughan writes about structural secrecy, which is a mechanism that leads to incorrect mental models.

But incorrect mental models are not something we talk much about explicitly in the software world. The RCA approach implicitly assumes there’s only a single thing that we didn’t know: the root cause of the incident. Once we find that, we’re done.

To believe that the LFI approach is worth doing, you need to believe that there is a whole bunch of things about the system that people don’t know, not just a single vulnerability. And there are some things that, say, Alice knows that Bob doesn’t, and that Alice doesn’t know that Bob doesn’t know.

Better system understanding leads to better decision making in the future

The payoff for RCA is clear: the elimination of a known vulnerability. But the payoff for LFI is a lot fuzzier: if the people in the organization know more about the system, they are going to make better decisions in the future.

The problem with articulating the value is that we don’t know when these future decisions will be made. For example, the decision might happen when responding to the next incident (e.g., now I know how to use that observability tool because I learned from how someone else used it effectively in the last incident). Or the decision might happen during the design phase of a future software project (e.g., I know to shard my services by request type because I’ve seen what can go wrong when “light” and “heavy” requests are serviced by same cluster) or during the coding phase (e.g., I know to explicitly set a reasonable timeout because Java’s default timeout is way too high).

The LFI approach assumes that understanding the system better will advance the expertise of the engineers in the organization, and that better expertise means better decision making.

On the one hand, organizations recognize that expertise leads to better decision making: it’s why they are willing to hire senior engineers even though junior engineers are cheaper. On the other hand, hiring seems to be the only context where this is explicitly recognized. “This activity will advance the expertise of our staff, and hence will lead to better future outcomes, so it’s worth investing in” is the kind of mentality that is required to justify work like the LFI approach.

Incident categories I’d like to see

If you’re categorizing your incidents by cause, here are some options for causes that I’d love to see used. These are all taken directly from the field of cognitive systems engineering research.

Production pressure

All of us are so often working near saturation: we have more work to do than time to do it. As a consequence, we experience pressure to get that work done, and the pressure affects how we do our work and the decisions we make. Multi-tasking is a good example of a symptom of production pressure.

Ask yourself “for the people whose actions contributed to the incident, what was their personal workload like? How did it shape their actions?”

Goal conflicts

Often we’re trying to achieve multiple goals while doing our work. For example, you may have a goal to get some new feature out quickly (production pressure!), but you also have a goal to keep your system up and running as you make changes. This creates a goal conflict around how much time you should put into validation: the goal of delivering features quickly pushes you towards reducing validation time, and the goal of keeping the system up and running pushes you towards increasing validation time.

If someone asks “Why did you take action X when it clearly contravenes goal G?”, you should ask yourself “was there another important goal, G1, that this action was in support of?”

Workarounds

How do you feel about the quality of the software tools that you use in order to get your work done? (As an example: how are the deployment tools in your org?)

Often the tools that we use are inadequate in one way or another, and so we resort to workarounds: getting our work done in a way that works but is not the “right” way to do it (e.g., not how the tool was designed to be used, against the official process of how to do things). Using workarounds is often dangerous because the system wasn’t designed with that type of work in mind. But if the dangerous way of doing work is the only way that the work can get done, then you’re going to end up with people taking dangerous actions.

If an incident involves someone doing something they weren’t “supposed to”, you should ask yourself, “did they do it this way because they are working around some deficiency in the tools that have to use?”

Automation surprises

Software automation often behaves in ways that people don’t expect: we have incorrect mental models of why the system is doing what it is, often because the system isn’t designed in a way to make it easy for us to form good mental models of behavior. (As someone who works on a declarative deployment system, I acutely feel the pain we can inflict on our users in this area).

If someone took the “wrong” action when interacting with a software system in some way, ask yourself “what was their understanding of the state of the world at the time, and what was their understanding of what the result of that action would be? How did they form their understanding of the system behavior?”

Do you find this topic interesting? If so, I bet you’ll enjoy attending the upcoming Learning from Incidents Conference taking place on Feb 15-16, 2023 in Denver, CO.

Cache invalidation really is one of the hardest problems in computer science

My colleagues recently wrote a great post on the Netflix tech blog about a tough performance issue they wrestled with. They ultimately diagnosed the problem as false sharing, which is a performance problem that involves caching.

I’m going to take that post and write a simplified version of part of it here, as an exercise to help me understand what happened. After all, the best way to understand something is to try to explain it to someone else.

But note that the topic I’m writing about here is outside of my personal area of expertise, so caveat lector!

The problem: two bands of CPU performance

Here’s a graph from that post that illustrates the problem. It shows CPU utilization for different virtual machines instances (nodes) inside of a cluster. Note that all of the nodes are configured identically, including running the same application logic and taking the same traffic.

Note that there are two “bands”, a low band at around 15-20% CPU utilization, and a high band that varies a lot, from about 25%-90%.

Caching and multiple cores

Computer programs keep the data that they need in main memory. The problem with main memory is that accessing it is slow in computer time. According to this site, a CPU instruction cycle is about 400ps, and accessing main memory (DRAM access) is 50-100ns, which means it takes ~ 125 – 250 cycles. To improve performance, CPUs keep some of the memory in a faster, local cache.

There’s a tradeoff between the size of the cache and its speed, and so computer architects use a hierarchical cache design where they have multiple caches of different sizes and speeds. It was an interaction pattern with the fastest on-core cache (the L1 cache) that led to the problem described here, so that’s the cache we’ll focus on in this post.

If you’re a computer engineer designing a multi-core system where each core has on-core cache, your system has to implement a solution for the problem known as cache coherency.

Cache coherency

Imagine a multi-threaded program where each thread is running on a different core:

thread T1 runs on CPU 1
thread T2 runs on CPU 2

There’s a variable used by the program, which we’ll call x.

Let’s also assume that both threads have previously read x, so the memory associated with x is loaded in the caches of both. So the caches look like this:

Now imagine thread T1 modifies x, and then T2 reads x.


T1             T2
--             --
x = x + 1
              if(x==0) {
              // shouldn't execute this!
              }

The problem is that T2’s local cache has become stale, and so it reads a value that is no longer valid.

The term cache coherency refers to the problem of ensuring that local caches in a multi-core (or, more generally, distributed) system stay in sync.

This problem is solved by a hardware device called a cache controller. The cache controller can detect when values in a cache have been modified on one core, and whether another core has cached the same data. In this case, the cache controller invalidates the stale cache. In the example above, the cache controller would invalidate the cache in T2. When T2 went to read the variable x, it would have to read the data from main memory into the core.

Cache coherency ensures that the behavior is correct, but every time a cache is invalidated and the same memory has to be retrieved from main memory again, it pays the performance penalty of reading from main memory.

The diagram above shows that the cache contains both the data as well as the addresses in main memory where the data comes from: we only need to invalidate caches that correspond to the same range of memory

Data gets brought into cache in chunks

Let’s say a program needs to read data from main memory. For example, let’s say it needs to read the variable named x. Let’s assume x is implemented as a 32-bit (4 byte) integer. When the CPU reads from main memory, the memory that holds the variable x will be brought into the cache.

But the CPU won’t just read the variable x into cache. It will read a contiguous chunk of memory that includes the variable x into cache. On x86 systems, the size of this chunk is 64 bytes. This means that accessing the 4 bytes that encodes the variable x actually ends up bringing 64 bytes along for the ride.

These chunks of memory stored in the cache are referred to as cache lines.

False sharing

We now almost have enough context to explain the failure mode. Here’s a C++ code snippet from the OpenJDK repository (from src/hotspot/share/oops/klass.hpp)

class Klass : public Metadata {
  ...
  // Cache of last observed secondary supertype
  Klass*      _secondary_super_cache;
  // Array of all secondary supertypes
  Array<Klass*>* _secondary_supers;

This declares two pointer variables inside of the Klass class: _secondary_super_cache, and _secondary_supers. Because these two variables are declared one after the other, they will get laid out next to each other in memory.

The two variables are adjacent in main memory.

The _secondary_super_cache is, itself, a cache. It’s a very small cache, one that holds a single value. It’s used in a code path for dynamically checking if a particular Java class is a subtype of another class. This code path isn’t commonly used, but it does happen for programs that dynamically create classes at runtime.

Now imagine the following scenario:

There are two threads: T1 on CPU 1, T2 on CPU 2
T1 wants to write the _secondary_super_cache variable and already has the memory associated with the _secondary_super_cache variable loaded in its L1 cache
T2 wants to read from the _secondary_supers variable and already has the memory associated with the _secondary_supers variable loaded in its L1 cache.

When T1 (CPU 1) writes to _secondary_super_cache, if CPU 2 has the same block of memory loaded in its cache, then the cache controller will invalidate that cache line in CPU 2.

But if that cache line contained the _secondary_supers variable, then CPU 2 will have to reload that data from memory to do its read, which is slow.

ssc refers to _secondary_super_cache, ss refers to _secondary_supers

This phenomenon, where the cache controller invalidates cached non-stale data that a core needed to access, which just so happens to be on the same cache line as stale data, is called false sharing.

What’s the probability of false sharing in this scenario?

In this case, the two variables are both pointers. On this particular CPU architecture, pointers are 64-bits, or 8 bytes. The L1 cache line size is 64 bytes. That means a cache line can store 8 pointers. Or, put another away, a pointer can occupy one of 8 positions in the cache line.

There’s only one scenario where the two variables don’t end up on the same cache line: when _secondary_super_cache occupies position 8, and _secondary_supers occupies position 1. In all of the other scenarios, the two variables will occupy the same cache line, and hence will be vulnerable to false sharing.

1 / 8 = 12.5%, and that’s roughly the number of nodes that were observed in the low band in this scenario.

And now I recommend you take another look at the original blog post, which has a lot more details, including how they solved this problem, as well as a new problem that emerged once they fixed this one.

There is no “Three Mile Island” event coming for software

In Critical Digital Services: An Under-Studied Safety-Critical Domain, John Allspaw asks:

Critical digital services has yet to experience its “Three-Mile Island” event. Is
such an accident necessary for the domain to take human performance seriously? Or can it translate what other domains have learned and make productive use of
those lessons to inform how work is done and risk is anticipated for the future?

I don’t think the software world will ever experience such an event.

The effect of TMI

The Three Mile Island accident (TMI) is notable, not because of the immediate impact on human lives, but because of the profound effect it had on the field of safety science.

Before TMI, the prevailing theories of accidents was that they were because of issues like mechanical failures (e.g., bridge collapse, boiler explosion), unsafe operator practices, and mixing up physical controls (e.g., switch that lowers the landing gear looks similar to switch that lowers the flaps).

But TMI was different. It’s not that the operators were doing the wrong things, but rather that they did the right things based on their understanding of what was happening, but their understanding of what was happening, which was based on the information that they were getting from their instruments, didn’t match reality. As a result, the actions that they took contributed to the incident, even though they did what they were supposed to do. (For more on this, I recommend watching Richard Cook’s excellent lecture: It all started at TMI, 1979).

TMI led to a kind of Cambrian explosion of research into human error and its role in accidents. This is the beginning of the era where you see work from researchers such as Charles Perrow, Jens Rasmussen, James Reason, Don Norman, David Woods, and Erik Hollnagel.

Why there won’t be a software TMI

TMI was significant because it was an event that could not be explained using existing theories. I don’t think any such event will happen in a software system, because I think that every complex software system failure can be “explained”, even if the resulting explanation is lousy. No matter what the software failure looks like, someone will always be able to identify a “root cause”, and propose a solution (more automation, better procedures). I don’t think a complex software failure is capable of creating TMI style cognitive dissonance in our industry: we’re, unfortunately, too good at explaining away failures without making any changes to our priors.

We’ll continue to have Therac-25s, Knight Capitals, Air France 447s, 737 Maxs, 911 outages, Rogers outages, and Tesla autopilot deaths. Some of them will cause enormous loss of human life, and will result in legislative responses. But no such accident will compel the software industry to, as Allspaw puts it, take human performance seriously.

Our only hope is that the software industry eventually learns the lessons that the safety science learned from the original TMI.

Up and down the abstraction hierarchy

As operators, when the system we operate is working properly, we use a functional description of the system to reason about its behavior.

Here’s an example, taken from my work on a delivery system. if somebody asks me, “Hey, Lorin, how do I configure my deployment so that a canary runs before it deploys to production?”, then I would tell them, “In your deliver config, add a canary constraint to the list of constraints associated with your production environment, and the delivery system will launch a canary and ensure it passes before promoting new versions to production.”

This type of description is functional; It’s the sort of verbiage you’d see in a functional spec. On the other hand, if an alert fires because the environment check rate has dropped precipitously, the first question I’m going to ask is, “did something deploy a code change?” I’m not thinking about function anymore, but I’m thinking of the lowest level of abstraction.

In the mid nineteen-seventies, the safety researcher Jens Rasmussen studied how technicians debugged electronic devices, and in the mid-eighties he proposed a cognitive model about how operators reason about a system when troubleshooting, in a paper titled the role of hierarchical knowledge representation in decisionmaking and system management. He called this model the abstraction hierarchy.

Rasmussen calls this model a means-ends hierarchy, where the “ends” are at the top (the function: what you want the system to do), and the “means” are at the bottom (how the system is physically realized). We describe the proper function of the system top-down, and when we successfully diagnose a problem with the system, we explain the problem bottom-up.

The abstraction hierarchy, explained with an example

Depicts the five levels of the abstraction hierarchy:

1. functional purpose
2. abstract functions
3. general functions
4. physical functions
5. physical form — The five levels of the abstraction hierarchy

To explain these, I’ll use the example of a car.

The functional purpose of the car is to get you from one place to another. But to make things simpler, let’s zoom in on the accelerator. The functional purpose of the accelerator is to make the car go faster.

The abstract functions include transferring power from the car’s power source to the wheels, as well as transferring information from the accelerator to that system about how much power should be delivered. You can think of abstract functions as being functions required to achieve the functional purpose.

The generalized functions are the generic functional building blocks you use to implement the abstract functions. In the case of the car, you need a power source, you need a mechanism for transforming the stored energy to mechanical energy, a mechanism for transferring the mechanical energy to the wheels.

The physical functions capture how the generalized function is physically implemented. In an electric vehicle, your mechanism for transforming stored energy to mechanical energy is an electric motor; in a traditional car, it’s an internal combustion engine.

The physical form captures the construction detail of how the physical function. For example, if it’s an electric vehicle that uses an electric motor, the physical form includes details such as where the motor is located in the car, what its dimensions are, and what materials it is made out of.

Applying the abstraction hierarchy to software

Although Rasmussen had physical systems in mind when he designed the hierarchy (his focus was on process control, and he worked at a lab that focused on nuclear power plants), I think the model can map onto software systems as well.

I’ll use the deployment system that I work on, Managed Delivery, as an example.

The functional purpose is to promote software releases through deployment environments, as specified by the service owner (e.g., first deploy to test environment, then run smoke tests, then deploy to staging, wait for manual judgment, then run a canary, etc.)

Here are some examples of abstract functions in our system.

There is an “environment check” control loop that evaluates whether each pending version of code is eligible for promotion to the next environment by checking its constraints.
There is a subsystem that listens for “new build” events and stores them in our database.
There is a “resource check” control loop that evaluates whether the currently deployed version matches the most recent eligible version.

For generalized functions, here are some larger scale building blocks we use:

a queue to consume build events generated by the CI system
a relational database to track the state of known versions
a workflow management system for executing the control loops

For the physical functions that realize the generalized functions:

SQS as our queue
MySQL Aurora as our relational database
Temporal as our workflow management system

For physical form, I would map these to:

source code representation (files and directory structure)
binary representation (e.g., container image, Debian package)
deployment representation (e.g., organization into clusters, geographical regions)

Consider: you don’t care about how your database is implemented, until you’re getting some sort of operational problem that involves the database, and then you really have to care about how it’s implemented to diagnose the problem.

Why is this useful?

If Rasmussen’s model is correct, then we should build operator interfaces that take the abstraction hierarchy into account. Rasmussen called this approach ecological interface design (EID), where the abstraction hierarchy is explicitly represented in the user interface, to enable operators to more easily navigate the hierarchy as they do their troubleshooting work.

I have yet to see an operator interface that does this well in my domain. One of the challenges is that you can’t rely solely on off-the-shelf observability tooling, because you need to have a model of the functional purpose and the abstract functions to build those models explicitly into your interface. This means that what we really need are toolkits so that organizations can build custom interfaces that can capture those top levels well. In addition, we’re generally lousy at building interfaces that traverse different levels: at best we have links from one system to another. I think the “single pane of glass” marketing suggests that people have some basic understanding of the problem (moving between different systems is jarring), but they haven’t actually figured out how to effectively move between levels in the same system.

Bad Religion: A review of Work Pray Code

When I worked as a professor at the University of Nebraska—Lincoln, after being there for a few months, during a conversation with the chair of the computer science department he asked me “have you found a church community yet?” I had not. I had, however, found a synagogue. The choice wasn’t difficult: there were only two. Nobody asked me a question like that after I moved to San Jose, which describes itself as the heart of Silicon Valley.

Why is Silicon Valley so non-religious is the question that sociologist Carolyn Chen seeks to answer here. As a tenured faculty member at UC Berkeley, Chen is a Bay Area resident herself. Like so many of us here, she’s a transplant: she grew up in Pennsylvania and Southern California, and first moved to the area in 2013 to do research on Asian religions in secular spaces.

Chen soon changed the focus of her research from Asian religions to the work culture of tech companies. She observes that people tend to become less religious when they move to the area, and are less engaged in their local communities. Tech work is totalizing, absorbing employees entire lives. Tech companies care for many of the physical needs of their employees in a way that companies in other sectors do not. Tech companies provide meditation/mindfulness (the companies use these terms interchangeably) to help their employees stay productive, but it is a neutered version of the meditation of its religious, Buddhist roots. Tech companies push up the cost of living, and provide private substitutes for public infrastructure, like shuttle busses.

Chen tries to weave these threads together into a narrative about how work substitutes for religion in the lives of tech workers in Silicon Valley. But the pieces just don’t fit together. Instead, they feel shoehorned in to support her thesis. And that’s a shame, because, as a Silicon Valley tech worker, many of the observations themselves ring true to my personal experience. Unlike Nebraska, Silicon Valley really is a very secular place, so much so that it was a plot point in an episode of HBO’s Silicon Valley. As someone who sends my children to religious school, I’m clearly in the minority at work. My employer provides amenities like free meals and shuttles. They even provide meditation rooms, access to guided meditations provided by the Mental Health Employee Resource Group, and subscriptions to the Headspace meditation app. The sky-high cost of living in Silicon Valley is a real problem for the area.

But Chen isn’t able to make the case that her thesis is the best explanation for this grab bag of observations. And her ultimate conclusion, that tech companies behave more and more like cults, just doesn’t match my own experiences working at a large tech company in Silicon Valley.

Most frustratingly, Chen doesn’t ever seem to ask the question, “are there other domains where some of these observations also hold?” Because so much of the description of the secular and insular nature of Silicon Valley tech workers applies to academics, the culture that Chen herself is immersed in!

Take this excerpt from Chen:

Workplaces are like big and powerful magnets that attract the energy of individuals away from weaker magnets such as families, religious congregations, neighborhoods, and civic associations—institutions that we typically associate with “life” in the “work-life” binary. The magnets don’t “rob” or “extract”—words that we use to describe labor exploitation. Instead they attract the filings, monopolizing human energy by exerting an attractive rather than extractive force. By creating workplaces that meet all of life’s needs, tech companies attract the energy and devotion people would otherwise devote to other social institutions, ones that, traditionally and historically, have been sources of life fulfillment.
Work Pray Code, p197

Compare this to an excerpt from a very different book: Robert Sommer’s sardonic 1963 book Expertland (sadly, now out of print), which describes itself as “an unrestricted inside view of the world of scientists, professors, consultants, journals, and foundations, with particular attention to the quaint customs, distinctive dilemmas, and perilous prospects”.

Experts know very few real people. Except for several childhood friends or close relatives, the expert does not know anybody who drives a truck, runs a grocery store, or is vice-president of the local Chamber of Commerce. His only connection with these people is in some kind of service relationship; they are not his friends, colleagues, or associates. The expert feel completely out of place at Lion’s or Fish and Game meeting. If he is compelled to attend such gatherings, he immediately gravitates to any other citizen of Expertland who is present… He has no roots, no firm allegiances, and nothing to gain or lose in local elections… Because he doesn’t vote in local elections, join service clubs, or own the house he lives in, outsiders often feel that the expert is not a good citizen.
Expertland pp 2-3

Chen acknowledges that work is taking over the lives of all high-skilled professionals, not just tech workers. But I found work-life balance to be much worse in academia than at a Silicon Valley tech company! To borrow a phrase from the New Testament, And why beholdest thou the mote that is in thy brother’s eye, but considerest not the beam that is in thine own eye?

We value possession of experience, but not its acquisition

Imagine you’re being interviewed for a software engineering position, and the interviewer asks you: “Can you provide me with a list of the work items that you would do if you were hired here?” This is how the action item approach to incident retrospectives feels to me.

We don’t hire people based on their ability to come up with a set of work items. We’re hiring them for their judgment, their ability to make good engineering decisions and tradeoffs based on the problems that they will encounter at the company. In the interview process, we try to assess their expertise, which we assume they have developed based on their previous work experience.

Incidents provide us with excellent learning opportunities because they confront us with surprises. If we examine an incident in detail, we can learn something about our system behavior that we didn’t know before.

Yet, while we recognize the value of experienced candidates when we do hiring, we don’t seem to recognize the value of increasing the experience of our current employees. Incidents are a visceral type of experience, and reflecting on these sorts of experiences is what increases our expertise. But you have to reflect on them to maximize the value, and you have to share this information out to the organization so that it isn’t just the incident responders that can benefit from the experience.

To me, learning from incidents is about increasing the expertise of an organization by reflecting on and sharing out the experiences of surprising operational events. Action items are a dime a dozen. What I care about is improving the organization’s ability to engineer software.