Your lying virtual eyes

Well, who you gonna believe, me or your own eyes? – Chico Marx (dressed as Groucho), from Duck Soup:

In the ACM Queue article Above the Line, Below the Line, the late safety researcher Richard Cook (of How Complex Systems Fail fame) notes how that we software operators don’t interact directly with the system. Instead, we interact through representations. In particular, we view representations of internal state of the system, and we manipulate these representations in order to effect changes, to control the system. Cook used the term line of representation to describe the split between the world of the technical (software) system and the world of the people who work with the technical system. The people are above the line of representation, and the technical system is below the line.

Above the line of representation are the people, organizations, and processes that shape, direct, and restore the technical artifacts that lie below that line.People who work above the line routinely describe what is below the line using concrete, realistic language.

Yet, remarkably, nothing below the line can be seen or acted upon directly. The displays, keyboards, and mice that constitute the line of representation are the only tangible evidence that anything at all lies below the line. All understandings of what lies below the line are constructed in the sense proposed by Bruno Latour and Steve Woolgar. What we “know”—what we can know—about what lies below the line depends on inferences made from representations that appear on the screens and displays.

In short, we can never actually see or change the system directly, all of our interactions mediated through software interfaces.

René Magritte would have appreciated Cook’s article

In this post, I want to talk about how this fact can manifest as incidents, and that our solutions rarely consider this problem. Let’s start off, as we so often do in the safety world, with the Three Mile Island accident.

Three Mile Island and the indicator light

I assume the reader has some familiarity with the partial meltdown that occurred at the Three Mile Island nuclear plant back in 1979. As it happens, there’s a great series of lectures by Cook on accidents. The topic of his first lecture is about how Three Mile Island changed the way safety specialists thought about the nature of accidents.

Here I want to focus on just one aspect of this incident: a particular indicator light in the Three Mile Island control room. During this incident, there was a type of pressure relief valve called a pilot-operated relief valve (PORV) that was stuck open. However, the indicator light for the state of this valve was off, which the operators interpreted (incorrectly, alas) as the valve being closed. Here I’ll quote the wikipedia article:

A light on a control panel, installed after the PORV had stuck open during startup testing, came on when the PORV opened. When that light—labeled Light on – RC-RV2 open —went out, the operators believed that the valve was closed. In fact, the light when on only indicated that the PORV pilot valve’s solenoid was powered, not the actual status of the PORV. While the main relief valve was stuck open, the operators believed the unlighted lamp meant the valve was shut. As a result, they did not correctly diagnose the problem for several hours.

What I found notable was the article’s comment about lack of operator training to handle this specific scenario, a common trope in incident analysis.

The operators had not been trained to understand the ambiguous nature of the PORV indicator and to look for alternative confirmation that the main relief valve was closed. A downstream temperature indicator, the sensor for which was located in the tail pipe between the pilot-operated relief valve and the pressurizer relief tank, could have hinted at a stuck valve had operators noticed its higher-than-normal reading. It was not, however, part of the “safety grade” suite of indicators designed to be used after an incident, and personnel had not been trained to use it. Its location behind the seven-foot-high instrument panel also meant that it was effectively out of sight.

Now, consider what happens if the agent acting on these sensors is an automated control system instead of a human operator.

Sensors, automation, and accidents: cases from aviation

In the aviation world, we have a combination of automation and human operators (pilots) who work together in real-time. The assumption is that if something goes wrong with the automation, the human can quickly take over and deal with the problem. But automation can make things too difficult for a human to be able to compensate for, and automation can be particularly vulnerable to sensor problems, as we can see in the following accidents:

Bombardier Learjet 60 accident, 2008

On September 19, 2008, in Columbia, South Carolina, a Bombardier Learjet 60 overran the runway during a rejected takeoff. As a consequence, four people aboard the plane, including the captain and first officer, were killed. In this case, the sensor issues were due to damage to electronics in the wheel well area after underinflated tires on the landing gear exploded.

The pilots reversed thrust to slow down the plane. However, the tires on the plane were under-inflated, and they exploded. As a result of the tire explosion, sensors in the wheel well area of the plane were damaged.

The thrust reverse system relies on sensor data to determine whether reversing thrust is a safe operation. Because of the sensor damage, the system determined that it was not safe to reverse thrust, and instead increased forward thrust. From the NTSB report:

In this situation, the EECs would transition from the reverse thrust power schedule to the
forward thrust power schedule during about a 2-second transition through idle power. During the entire sequence, the thrust reverser levers in the cockpit would remain in the reverse thrust idle position (as selected by the pilot) while the engines produced forward thrust. Because both the thrust reverser levers and the forward thrust levers share common RVDTs (one for the left engine and one for the right engine), the EECs, which receive TLA information from the RVDTs, would signal the engines to produce a level of forward thrust that generally corresponds with the level of reverse thrust commanded; that is, a pilot commanding full reverse thrust (for maximum deceleration of the airplane) would instead receive high levels of forward thrust (accelerating the airplane) according to the forward thrust power schedule

(My initial source for this was John Thomas’s slides.)

Air France 447, 2009

On June 1, 2009, Air France 447 crashed, killing all passengers and crew. The plane was an Airbus A330-200. In this accident, the sensor problem is believed to be caused by ice crystals that accumulated inside of pitot tube sensors, creating a blockage which lead to erroneous readings. Here’s a quote from an excellent Vanity Fair article on the crash:

Just after 11:10 P.M., as a result of the blockage, all three of the cockpit’s airspeed indications failed, dropping to impossibly low values. Also as a result of the blockage, the indications of altitude blipped down by an unimportant 360 feet. Neither pilot had time to notice these readings before the autopilot, reacting to the loss of valid airspeed data, disengaged from the control system and sounded the first of many alarms—an electronic “cavalry charge.” For similar reasons, the automatic throttles shifted modes, locking onto the current thrust, and the fly-by-wire control system, which needs airspeed data to function at full capacity, reconfigured itself from Normal Law into a reduced regime called Alternate Law, which eliminated stall protection and changed the nature of roll control so that in this one sense the A330 now handled like a conventional airplane. All of this was necessary, minimal, and a logical response by the machine.

This is what the safety researcher David Woods refers to as bumpy transfer of control, where the humans must suddenly and unexpectedly take over control of an automated system, which can lead to disastrous consequences.

Boeing 737 MAX 8 (2018, 2019)

On October 29, 2018, Lion Air Flight 610 crashed thirteen minutes after takeoff, killing everyone on board. Five months later, on March 10, 2019, Ethiopian Airlines Flight 302 crashed six minutes after takeoff, also killing everyone on board. Both planes were Boeing 737 MAX 8. In both cases, the sensor problem was related to the angle-of-attack (AOA) sensor.

Lion Air Flight 610 investigation report:

The replacement AOA sensor that was installed on the accident aircraft had
been mis-calibrated during an earlier repair. This mis-calibration was not
detected during the repair.

Ethiopian Airline Flight 302 investigation report:

Shortly after liftoff, the left Angle of Attack sensor recorded value became erroneous and the left stick shaker activated and remained active until near the end of the recording.

An automation subsystem in the 737 MAX called Maneuvering Characteristics Augmentation System (MCAS) automatically pushed the nose down in response to the AOA sensor data.

What should we take away from these?

Here I’ve given examples from aviation, but sensor-automation problems are not specific to that domain. Here are a few of my own takeaways.

We designers can’t assume sensor data will be correct

The kinds of safety automation subsystems we build in tech are pretty much always closed-loop control systems. When designing such systems in the tech world, how often have you heard someone ask, “what happens if there’s a problem with the sensor data that the system is reacting to?”

This goes back to the line of representation problem: that no agent ever gets access to the true state of the system, it only gets access to some sort of representation. The irony here is that it doesn’t just apply to humans (above the line) making sense of signals, it also applies to technical system components (below the line!) making sense of signals from other technical components.

Designing a system that is safe in the face of sensor problems is hard

Again, from the NTSB report of the Learjet 60 crash:

Learjet engineering personnel indicated that the uncommanded stowage of the thrust reversers in the event of any system loss or malfunction is part of a fail-safe design that ensures that a system anomaly cannot result in a thrust reverser deployment in flight, which could adversely affect the airplane’s controllability. The design is intended to reduce the pilot’s emergency procedures workload and prevent potential mistakes that could exacerbate an abnormal situation.

The thrust reverser system behavior was designed by aerospace engineers to increase safety, and ended up making things worse! Good luck imagining all of these sorts of scenarios when you design your systems to increase safety.

Even humans struggle in the face of sensor problems

People are better equipped to handle sensor problems than automation, because we don’t seem to be able to build automation that can handle all of the possible kinds of sensor problems that we might throw at a problem.

But even for humans, sensor problems are difficult. While we’ll eventually figure out what’s going on, we’ll still struggle in the face of conflicting signals, as anyone who has responded to an incident can tell you. And in high-tempo situations, where we need to respond quickly enough or something terrible will happen (like in the Air France 447 case), we simply might not be able to respond quickly enough.

Instead of focusing on building the perfect fail-safe system to prevent this next time, I wish we’d spend more time thinking about, “how can we help the human figure out what the heck is happening when the input signals don’t seem to make sense”.

Second-class interactions are a first-class risk

Below is a screenshot of Vizceral, a tool that was built by a former teammate of mine at Netflix. It provides a visualization of the interactions between the various microservices.

Vizceral uses moving dots to depict how requests are currently flowing through the Netflix microservice architecture. Vizceral is able to do its thing because of the platform tooling, which provides support for generating a visualization like this by exporting a standard set of inter-process communication (IPC) metrics.

What you don’t see depicted here are the interactions between those microservices and the telemetry platform that ingest these metrics. There’s also logging and tracing data, and those get shipped off-box via different channels, but none of those channels show up in this diagram.

In fact, this visualization doesn’t represent interactions with any of the platform services. You won’t see bubbles that represent the compute platform or the CI/CD platform represented in a diagram like this, even though those platform services all interact with these application services in important ways.

I call the first category of interactions, the ones between the application services, as first-class, and the second category, the ones where the interactions involve platform services, as second-class. It’s those second-class interactions that I want to say more about.

These second-class interactions tend to have a large blast radius, because successful platforms by their nature have a large blast radius. There’s a reason why there’s so much havoc out in the world when AWS’s us-east-1 region has a problem: because so many services out there are using us-east-1 as a platform. Similarly, if you have a successful platform within your organization, then by definition it’s going to see a lot of use, which means that if it experiences a problem, it can do a lot of damage.

These platforms are generally more reliable than the applications that run atop them, because they have to be: platforms naturally have higher reliability requirements than the applications that run atop them. They have these requirements because they have a large blast radius. A flaky platform is a platform that contributes to multiple high-severity outages, and systems that contribute to multiple high-severity outages are the systems were reliability work gets prioritized.

And a reliable system is a system whose details you aren’t aware of, because you don’t need to be. If my car is very reliable, then I’m not going to build an accurate mental model of how my car works, because I don’t need to: it just works. In her book Human-Machine Reconfigurations: Plans and Situated Actions, the anthropologist Lucy Suchman used the term representation to describe the activity of explicitly constructing a mental model of how a piece of technology works, and she noted that this type of cognitive work only happens when we run into trouble. As Suchman puts it:

[R]epresentation occurs when otherwise transparent activity becomes in some way problematic

Hence the irony: these second-class interactions tend not to be represented in our system models when we talk about reliability, because they are generally not problematic.

And so we are lulled into a false sense of security. We don’t think about how the plumbing works, because the plumbing just works. Until the plumbing breaks. And then we’re in big trouble.

Book Review: Trust in Numbers

Trust in Numbers: The Pursuit of Objectivity in Science and Public Life by Theodore Porter, Distinguished Professor Emeritus of History, UCLA.

There are two general approaches to decision-making. One way is to make a judgment call. Informally, you could call this “trusting your gut”. Formally, you could describe this as a subjective, implicit process. The other way is to use an explicit approach that relies on objective, quantitative data, for example, doing a return-on-investment (ROI) calculation on a proposed project to decide whether to undertake the project. We use the term rigorous to describe these type of approaches, and we generally regard them as superior.

Here, Porter argues that quantitative, rigorous decision-making in a field is not a sign of its maturity, but rather its political weakness. In fields where technical professionals enjoy a significant amount of trust, these professionals do decision-making using personal judgment. While professionals will use quantitative data as input, their decisions are ultimately based on their own subjective impressions. (For example, see Julie Gainsburg’s notion of skeptical reverence in The Mathematical Disposition of Structural Engineers). In Porter’s account, we witnessed an increase of rigorous decision-making approaches in the twentieth century because of a lack of trust in certain professional fields, not because the quantitative approaches yielded better results.

It’s only in fields where the public does not grant deference to professionals that they are compelled to use explicit, objective processes to make the decisions. They are forced to show their work in a public way because they aren’t trusted. In some cases, a weak field adopts rigor to strengthen itself in the eyes of the public, such as experimental psychology’s adoption of experimental rigor (in particular, ESP research). Most of the case studies in the book come from areas where a field was compelled to adopt objective approaches because there was explicit political pressure and the field did not have sufficient power to resist.

In some cases, professionals did have the political clout to push back. An early chapter of the book discusses a problem that the British parliament wrestled with in the late nineteenth century: unreliable insurance companies that would happily collect premiums but then would eventually fail and would hence be unable to pay out when their customers submitted claims. A parliamentary committee formed and heard testimony from actuaries about how the government could determine whether an insurance company was sound. The experienced actuaries from reputable companies argued that it was not possible to define an objective procedure for assessing a company. They insisted that “precision is not attainable through actuarial methods. A sound company depends on judgment and discretion.” They were concerned that a mechanical, rule-based approach wouldn’t work:

Uniform rules of calculation, imposed by the state, might yield “uniform errors.” Charles Ansell, testifying before another select committee a decade earlier, argued similarly, then expressed his fear that the office of government actuary would fall to “some gentlemen of high mathematical talents, recently removed from one of our Universities, but without any experience whatever, though of great mathematical reputation.” This “would not qualify him in any way whatever for expressing a sound opinion on a practical point like that of the premiums in a life assurance.”

Trust in Numbers, pp108-109

Porter tells a similar story about American accountants. To stave off having standardized rules imposed on them, the American Institute of Accountants defined standards for its members, but these were controversial. One accountant, Walter Wilcox, argued in 1941 that “Cost is not a simple fact, but is a very elusive concept… Like other aspects of accounting, costs give a false impression of accuracy.” Similarly, when it came to government-funded projects, the political pressure was simply too strong to defer to government civil engineers, such as the French civil engineers who had to help decide which rail projects should be funded, or the U.S. Army Corps of Engineers who had to help make similar decisions about waterway projects such as dams and reservoirs. In the U.S., they settled on a cost-benefit analysis process, where the return on investment had to exceed 1.0 in order to justify a project. But, unsurprisingly, there were conflicts over how benefits were quantified, as well as over how to classify costs. While the output may have been a number, and the process was ostensibly objective, because it needed to be, ultimately these numbers were negotiable and assessments changed as a function of political factors.

In education, teachers were opposed to standardized testing, but did not have the power to overcome it. On the other hands, doctors were able to retain the use of their personal judgment for diagnosing patients. However, the regulators had sufficient power that they were able to enforce the use of objective measures for evaluating drugs, and hence were able to oversee some aspect of medical practice.

This tug of war between rigorous, mechanical objectivity and élite professional autonomy continues to this day. Professionals say “This requires private knowledge; trust us”. Sometimes, the public says “We don’t trust you anymore. Make the knowledge public!”, and the professionals have no choice but to relent. On the subject of whether we are actually better off when we trade away judgment for rigor, Porter is skeptical. I agree.

Consistency

“Welcome aboard to BigCo!”

“Thanks! I’m excited to be here. This is my first tech job, even if it is just an internship.”

“We’re going to start you off with some automated testing. You’re familiar with queues, right?”

“The data structure? Sure thing. First in, first out.”

“Great! We need some help validating that our queueing module is always working properly. We have a bunch of test scenarios written, and we want need to someone to check that the observed behavior of the queue is correct.”

“So, for input, do I get something like a history of interactions with the queue? Like this?”

q.add("A") -> OK
q.add("B") -> OK
q.pop() -> "A"
q.add("C") -> OK
q.pop() -> "B"
q.pop() -> "C"

“Exactly! That’s a nice example of a correct history for a queue. Can you write a program that takes a history like that as input and returns true if it’s a valid history?”

“Sure thing.”

“Excellent. We’ll also need your help generating new test scenarios.”

A few days later

“I think I found a scenario where the queue is behaving incorrectly when it’s called by a multithreaded application. I got a behavior that looks like this:”

q.add("A") -> OK
q.add("B") -> OK
q.add("C") -> OK
q.pop() -> "A"
q.pop() -> "C"
q.pop() -> "B"

“Hmmm. That’s definitely incorrect behavior. Can you show me the code you used to generate the behavior?”

“Sure thing. I add the elements to the queue in one thread, and then I spawn a bunch of new threads and dequeue in the new threads. I’m using the Python bindings to call the queue. My program looks like this.”

from bigco import Queue
from threading import Thread

def pop_and_print(q):
val = q.pop()
print(val)

q = Queue()
q.add("a")
q.add("b")
q.add("c")

Thread(target=pop_and_print, args=[q]).run()
Thread(target=pop_and_print, args=[q]).run()
Thread(target=pop_and_print, args=[q]).run()

“And the output looked like this:”

A
C
B

“Well, that’s certainly not the order I expect the output to be printed in, but how do you know the problem is that the queue is actually behaving correctly? It might be that the values were dequeued in the correct order, but because of the way the threads are scheduled, the print statements were simply executed in a different order than you expect.”

“Hmmm. I guess you’re right: just looking at the order of the printed output doesn’t give me enough information to tell if the queue is behaving correctly or not. Let me try printing out the thread ids and the timestamps.”

[id0] [t=1] before pop
[id0] [t=2] after pop
[id0] [t=3] output: A
[id1] [t=4] before pop
[id2] [t=5] before pop
[id2] [t=6] after pop
[id2] [t=7] output: C
[id1] [t=8] after pop
[id1] [t=9] output: B

“Oh, I see what happened! The operations of thread 1 and thread 2 were interleaved! I didn’t think about what might happen in that case. It must have been something like this:”

[id0]                  [id1]                  [id2]
q.pop()->"A"
print("A")
                       q.pop()->"B"
                                              q.pop()->"C"
                                              print("C")
                       print("B")

“Well, it looks like the behavior is still correct, the items got dequeued in the expected order, it’s just that they got printed out in a different order.”

The next day

“After thinking through some more multithreaded scenarios, I ran into a weird situation that I didn’t expect. It’s possible that the “pop” operations overlap in time across the two different threads. For example, “pop” might start on thread 1, and then in the middle of the pop operation, the operating system schedules thread 2, and it starts in the middle.”


[id0]             [id1]                  [id2]
q.pop(): start
q.pop(): end
print("A")
                  q.pop(): start
                  |                      q.pop(): start
                  q.pop(): end           |
                                         q.pop(): end
                                         print("C")
                  print("B")

“Let’s think about this. If id1 and id2 overlap in time like this, what do you think the correct output should be? ‘ABC’ or ‘ACB’?”

“I have no idea. I guess we can’t say anything!”

“So, if the output was ‘ABB’, you’d consider that valid?”

“Wait, no… It can’t be anything. It seems like either ‘ABC’ or ‘ACB’ should be valid, but not “ABB”.

“How about ‘BCA’? Would that be valid here?”

“No, I don’t think so. There’s no overlap between the first pop operation and the others, so it feels like the pop in id0 should return “A”.

“Right, that makes sense. So, in a concurrent world, we have potentially overlapping operations, and that program you wrote that checks queue behaviors doesn’t have any notion of overlap in it. So we need to be able to translate these potentially overlapping histories into the kind of sequential history your program can handle. Based on this conversation, we can use two rules:

1. If two operations don’t overlap (like the pop in id0 and the pop in id1) in time, then we use the time ordering (id0 happened before id1).

2. If two operations do overlap in time, then either ordering is valid.

“So, that means that when I check whether a multithreaded behavior is valid, I need to actually know the time overlap of the operations, and then generate multiple possible sequential behaviors, and check to see if the behavior that I witnesses corresponds to one of those?”

“Yes, exactly. This is a consistency model called linearizability. If our queue has linearizable consistency, that means that for any behavior you witness, you can define a linearization, an equivalent sequential behavior. Here’s an example.”

[id0]             [id1]                  [id2]
q.add("a")
q.add("b")
q.add("c")

q.pop(): start
q.pop()->"A"
                  q.pop(): start
                  |                      q.pop(): start
                  |                      q.pop()->"C"
                  q.pop()->"B"            

“The question is: can we generate a linearization based on the two rules above? We can! Because the “id1” and “id2” overlap, we can generate a linearization where the “id1″ operation happens first. One way to think about it is to identify a point in time between the start and end of the operation and pretend that’s when the operation really happens. I’ll mark these points in time with an ‘x’ in the diagram.

[id0]             [id1]                  [id2]
q.add("a")
q.add("b")
q.add("c")

q.pop(): start
x
q.pop()->"A"
                  q.pop(): start
                                         q.pop(): start
                  x
                                         x
                                         q.pop()->"C"
                  q.pop()->"B"            

“Now we can rewrite this as a linear history.”

q.add("a")
q.add("b")
q.add("c")
q.pop()->"A"
q.pop()->"B"                                     
q.pop()->"C"

Going distributed

“We’re expanding our market. We’re building on our queue technology to build a distributed queue. We’re also providing a new operation: “get”. When you call “get” on a distributed queue, you get the entire contents of the queue, in queue order.”

“Oh, so a valid history would be something like this?”

q.add("A") 
q.add("B")
q.get() -> ["A","B"]
q.add("C")
q.get() -> [A","B","C"]

“Exactly! One use case we’re targeting is using our queue for implementing online chat, so the contents of a queue might look like this:”

["Alice: How are you doing?",
 "Bob: I'm fine, Alice. How are you?",
"Alice: I'm doing well, thank you."]

CAPd

“OK, I did some testing with the distributed queue. ran into a problem with the distributed queue. Look at this history, it’s definitely wrong. Note that the ids here are process ids, not thread ids, because we’re running on different machines.


[id0]                         [id1]
q.add("Alice: Hello"): start
q.add(...) -> OK
                              q.add("Bob: "Hi"): start
                              q.add(...)->OK
                              q.get(): start
                              q.get()-> ["Bob: Hi"]

“When process 1 called ‘get’, it didn’t see the “Alice: Hello” entry, and that operation completed before the ‘get’ started! This history isn’t linearizable!”

“You’re right, our distributed queue isn’t linearizable. Note that we could modify this history to make it linearizable if process 0’s add operation did not complete until after the get:

[id0]                         [id1]
q.add("Alice: Hello"): start

                              q.add("Bob: "Hi"): start
                              q.add(...) -> OK
                              q.get(): start
                              q.get()-> ["Bob: Hi"]
q.add(...) -> OK

“Now we can produce a valid linearization from the history”

q.add("Bob: "Hi")
q.get()->["Bob: Hi"]
q.add("Alice: Hello")

“But look what we had to do: we had to delay the completion of that add operation. This is the lesson of the CAP theorem: if you want your distributed object to have linearizable consistency, then some operations might take an arbitrarily long time to complete. With our queue, we decided to prefer availability, so that all operations are guaranteed to complete within a certain period of time. Unfortunately, once we give up on linearizability, things can get pretty weird. Let’s see how many different types of weird things you can find.”

Monotonic reads

“Here’s a weird one. The ‘Hi’ message disappeared in the second read!”

[id0]              [id1]                  [id2]
                   q.add("A: Hello")
                                         q.add("B: Hi")
q.get()->["A: Hello", "B: Hi"]
q.get()->["A: Hello"]

“Yep, this violates a property called monotonic reads. Once process 0 has seen the effect of the add(“B: Hi”) operation, we expect that it will always see it in the future. This is an example of a session property. If the two gets happened on two different processes, this would not violate the monotonic reads property. For example, the following history doesn’t violate monotonic reads, even though the operations and ordering are the same. That’s because one of the gets is in process 0, and the other is in process 1, and the monotonic reads property only applies to reads within the same process.

[id0]              [id1]                  [id2]
                   q.add("A: Hello")
                                         q.add("B: Hi")
q.get()->["A: Hello", "B: Hi"]
                   q.get()->["A: Hello"]

“All right, let’s say we can guarantee monotonic reads. What other kinds of weirdness happen?”

Read your writes

[id0]
q.add("A: Hello")
q.get() -> []

Read your writes is one of the more intuitive consistency properties. If a process writes data, and then does a read, it should be able to see the effective of the write. Here we did a write, but we didn’t see it.”

Writes follow reads

[id0]
q.get() -> []
q.get() -> ["A: Hello"]
q.add("A: Hello")

“Here’s a case where read-your-writes isn’t violated (in fact, we don’t do any reads after the write), but something very strange has happened. We saw the effect of our write before we actually did the write! This violates the writes follow reads property. This also called session causality, and you can see why: when it was violated, we saw the effect before the cause!”

Monotonic writes

[id0]                      [id1]
q.add("A: Hi there!")
q.add("A: How are you?")
                          q.get() -> ["A: How are you?"]

“Hey, process 1 saw the ‘How are you?’ but not the ‘Hi there!’, even though they both came from process 0.”

“Yep. It’s weird that process 1 saw the second write from process 0, but it didn’t see the first write. This violates the monotonic writes property. Note that if the two writes were from different processes, this would not violate the property. For example, this would be fine:

[id0]                      [id1]
q.add("A: Hi there!")
                          q.add("A: How are you?")
                          q.get() -> ["A: How are you?"]

Consistent prefix

[id0]              [id1]
q.add("A: Hello")
                   q.add("B: Hi")
                   q.get()->["B: Hi"]
                   q.get()->["A: Hello", "B: Hi"]

“From process 1’s perspective, it looks like the history of the chat log changed! Somehow, ‘A: Hello’ snuck in before ‘B: Hi’, even though process 1 had already seen ‘B: Hi’.”

“Yes, this violates a property called consistent prefix. Note that this is different from monotonic reads, which is not violated in this case. (Sadly, the Jepsen consistency page doesn’t have an entry for consistent prefix).

Reasoning about correctness in a distributed world

One way to think about what it means for a data structure implementation to be correct is to:

  1. Define what it means for a particular execution history to be correct
  2. Check that every possible execution history for the implementation satisfies this correctness criteria.

Step 2 requires doing a proof, because in general there are too many possible execution histories for us to check exhaustively. But, even if we don’t actually go ahead and do the formal proof, it’s still useful to think through step 1: what it means for a particular execution history to be correct.

As we move from sequential data structures to concurrent (multithreaded) ones and then distributed ones, things get a bit more complicated.

Recall that for the concurrent case, in order to check that a particular execution history was correct, we had to see if we could come up with a linearization. We had to try and identify specific points in time when operations took effect to come up with a sequential version of the history that met our sequential correctness criteria.

In Principles of Eventual Consistency, Sebastian Burckhardt proposed a similar type of approach for validating the execution history of a distributed data structure. (This is the approach that Viotti & Vukolic extended. Kyle Kingsbury references Viotti and Vukolic on the Jepsen consistency models page that I’ve linked to several times here).

Execution histories as a set of events

To understand Burckhardt’s approach, we first have to understand how he models a distributed data structure execution history. He models an execution history as a set of events, where each event has associated with it:

  1. The operation (including arguments), e.g.:
    • get()
    • add(“Hi”)
  2. A return value, e.g.
    • [“Hi”, “Hello”]
    • OK

He also defines two relations on these events, returns-before and same-session.

Returns-before

The returns-before (rb) relation models time. If there are two events, e1, e2, and (e1,e2) is in rb, that means that the operation associated with e1 returned before the operation associated with e2 started.

Let’s take this example, where the two add operations overlap in time:

[id0]              [id1]                  [id2]
                   add("A: Hello"):start
                   |                      add("B: Hi"):start
                   |                      add("B: Hi"):end
                   add("A: Hello"):end

get()->["A: Hello", "B: Hi"]
                   get()->["A: Hello"]

I’ll use the following labeling for the events:

  • e1: add(“A: Hello”)
  • e2: add(“B: Hi”)
  • e3: get() -> [“A: Hello”, “B:Hi”]
  • e4: get() -> [“A: Hello”]

Here, rb={(e1,e3), (e1,e4),(e2,e3),(e2,e4),(e3,e4)}

Note that neither (e1,e2) nor (e2,e1) is in rb, because the two operations overlap in time. Neither one happens before the other.

Same-session

The same-session (ss) relation models the grouping of operations into processes. In the example above, there are three sessions (id0, id1, id2), and the same-session relation looks like this: ss={(e1,e1),(e1,e4),(e4,e1),(e4,e4),(e2,e2),(e3,e3)}. (Note: in this case, there are only two operations that are in the same session, e1 and e4

This is what the graph looks like with the returns-before (rb) and same-session (ss) relationship shown.

Explaining executions with visibility and arbitration

Here’s the idea behind Burckhardt’s approach. He defines consistency properties in terms of the returns-before (rb) relation, the same-session (ss) relation, and two other binary relations called visibility (vis) and arbitration (ar).

For example, an execution history satisfies read my writes if: (rb ∩ ss) ⊆ vis

In this scheme, an execution history is correct if we can come up with visibility and arbitration relations for the execution such that:

  1. All of the consistency properties we care about are satisfied by our visibility and arbitration relations.
  2. Our visibility and arbitration relations don’t violate any of our intuitions about causality.

You can think of coming up with visibility and arbitration relations for a history as coming up with an explanation for how the history makes sense. It’s a generalization of the process we used for linearizability where we picked a specific point in time where the operation took effect.

(1) tells us that we have to pick the right vis and ar (i.e., we have to pick a good explanation). (2) tells us that we don’t have complete freedom in picking vis and ar (i.e., our explanations have to make intuitive sense to human beings).

You can think of the visibility relation as capturing which write operations were visible to a read, and the arbitration relation as capturing how the data structure should reconcile conflicting writes.

Specifying behavior based on visibility and arbitration

Unfortunately, in a distributed world, we can no longer use the sequential specification for determining correct behavior. In the sequential world, writes are always totally ordered, but in the distributed world, we might have to deal with two different writes that aren’t ordered in a meaningful way.

For example, consider the following behavior:

    [id0]              [id1]                  [id2]
e1. add("A")
e2.                   add("B")
e3.                                          get()->???

What’s a valid value for ???. Let’s assume we’ve been told that: vis={(e1,e3),(e2,e3)}. This means that both writes are visible to process 3.

Based on our idea of how this data structure should work, e3 should either be: [“A”,”B”] or [“B”,”A”]. But the visibility relationship doesn’t provide enough information to tell us which one of these it was. We need some additional information to determine what the behavior should be.

This is where the arbitration relation comes in. This relation is always a total ordering. (For example, if ar specifies an ordering of e1->e2->e3, then the relation would be {(e1,e2),(e1,e3),(e2,e3)}. ).

If we define the behavior of our distributed queue such that the writes should happen in arbitration order, and we set ar=e1->e2->e3, then e3 would have to be get()->[“A”,”B”].

Let’s look at a few examples:

    [id0]              [id1]
e1. add("A")
e2.                    add("B")
e3. get()->["B","A"]
e4.                    get()->["B","A"]

The above history is valid, we can choose: vis={(e1,e3),(e2,e3),(e1,e4),(e2,e4)} and ar=e2->e1->e3->e4

    [id0]              [id1]
e1. add("A")
e2.                    add("B")
e3. get()->["A","B"]
e4.                    get()->["B","A"]

The above history is invalid, because there’s no arbitration and visibility relations we can come up with that can explain both e3 and e4.

    [id0]              [id1]
e1. add("A")
e2.                    add("B")
e3. get()->["A"]
e4.                    get()->["B","A"]

The above history is valid, because we can do: vis={(e1,e3),(e2,e4),(e3,e4))}, ar=e1->e2->e3->e4. Note that even though (e2,e3) is in ar, e2 is not visible to e3, and an operation only has to reflect the visible writes.

People don’t like it when you violate causality

Remember the example from “writes follow reads”?

[id0]
e1. q.get() -> []
e2. q.get() -> ["A: Hello"]
e3. q.add("A: Hello")

Note that we can come up with valid vis and ar relations for this history:

  • vis = {(e3,e2)}
  • ar = e1->e3->e2

But, despite the fact that we can come up with an explanation for this history, it doesn’t make sense to us, because e3 happened after e2. You can see why this is also referred to as session causality, because it violates our sense of causality: we read a write that happened in the future!

This is a great example of one of the differences between programming and formal modeling. It’s impossible to write a non-causal program (i.e., a program whose current output depends on future inputs). On the other hand, in formal modeling, we have no such restrictions, so we can always propose “impossible to actually happen in practice” behaviors to look at. So we often have to place additional constraints on the behaviors we generate with formal models to ensure that they’re actually realizable.

Sometimes we do encounter systems that record history in the wrong order, which makes the history look non-causal.

History is sometimes re-ordered in such a way that it looks like causality has been violated

Consistency as constraints on relations

The elegant thing about this relation-based model of execution histories is that the consistency models can be expressed in terms of them. Burckhardt conveniently defines two more relationships.

Session-order (so) is the ordering of events within each session, expressed as: so = rb ∩ ss

Happens-before (hb) is a causal ordering, in the sense of Lamport’s Time, Clocks, and the Ordering of Events in a Distributed System paper. (e1,e2) is in hb if (e1,e2) is in so (i.e., e1 comes before e2 in the same session), or if (e1,e2) is in vis (i.e., e1 is visible to e2), or if there’s some transitive relationship (e.g., there’s some e3 such that (e1,e3) and (e3,e2) are in so or vis.

Therefore, happens-before is the transitive closure of so ∪ vis, which we write as: hb = (so ∪ vis)⁺ . We can define no circular causality as no cycles in the hb relation or, as Burckhardt writes it: NoCircularCausality = acyclic(hb)

If you made it all of the way here, I’d encourage you to check out Burckhardt’s Principles of Eventual Consistency book. You can get the PDF for free by clicking the “Publication” button the web page.

For want of a dollar

Back in August, The New York Times ran a profile of Morris Chang, the founder of TSMC.

It’s hard to overstate the role that this Taiwan-based semiconductor company plays in the industry. If you search for articles about it, you’ll see headlines like TSMC: The Most Important Tech Company You Never Heard Of and TSMC: how a Taiwanese chipmaker became a linchpin of the global economy.

What struck me in the NY Times article was this anecdote about Chang’s search for a job after he failed out of a Ph.D. program at MIT in 1955 (emphasis mine):

Two of the best offers arrived from Ford Motor Company and Sylvania, a lesser-known electronics firm. Ford offered Mr. Chang $479 a month for a job at its research and development center in Detroit. Though charmed by the company’s recruiters, Mr. Chang was surprised to find the offer was $1 less than the $480 a month that Sylvania offered.

When he called Ford to ask for a matching offer, the recruiter, who had previously been kind, turned hostile and told him he would not get a cent more. Mr. Chang took the engineering job with Sylvania. There, he learned about transistors, the microchip’s most basic component.

“That was the start of my semiconductor career,” he said. “In retrospect, it was a damn good thing.”

The course of history changed because an internal recruiter Ford refused to offer him an additional dollar a month ($11.46 in 2023 dollars) to match a competing offer!

This is the sort of thing that historians call contingency.

Missing the forest for the trees: the component substitution fallacy

Here’s a brief excerpt from a talk by David Woods on what he calls the component substitution fallacy (emphasis mine):

Everybody is continuing to commit the component substitution fallacy.

Now, remember, everything has finite resources, and you have to make trade-offs. You’re under resource pressure, you’re under profitability pressure, you’re under schedule pressure. Those are real, they never go to zero.

So, as you develop things, you make trade offs, you prioritize some things over other things. What that means is that when a problem happens, it will reveal component or subsystem weaknesses. The trade offs and assumptions and resource decisions you made guarantee there are component weaknesses. We can’t afford to perfect all components.

Yes, improving them is great and that can be a lesson afterwards, but if you substitute component weaknesses for the systems-level understanding of what was driving the event … at a more fundamental level of understanding, you’re missing the real lessons.

Seeing component weaknesses is a nice way to block seeing the system properties, especially because this justifies a minimal response and avoids any struggle that systemic changes require.

Woods on Shock and Resilience (25:04 mark)

Whenever an incident happens, we’re always able to point to different components in our system and say “there was the problem!” There was a microservice that didn’t handle a certain type of error gracefully, or there was bad data that had somehow gotten past our validation checks, or a particular cluster was under-resourced because it hadn’t been configured properly, and so on.

These are real issues that manifested as an outage, and they are worth spending the time to identify and follow up on. But these problems in isolation never tell the whole story of how the incident actually happened. As Woods explains in the excerpt of his talk above, because of the constraints we work under, we simply don’t have the time to harden the software we work on to the point where these problems don’t happen anymore. It’s just too expensive. And so, we make tradeoffs, we make judgments about where to best spend our time as we build, test, and roll out our stuff. The riskier we perceive a change, the more effort we’ll spend on validation and rollout of the change.

And so, if we focus only on issues with individual components, there’s so much we miss about the nature of failure in our systems. We miss looking at the unexpected interactions between the components that enabled the failure to happen. We miss how the organization’s prioritization decisions enabled the incident in the first place. We also don’t ask questions like “if we are going to do follow-up work to fix the component problems revealed by this incident, what are the things that we won’t be doing because we’re prioritizing this instead?” or “what new types of unexpected interactions might we be creating by making these changes?” Not to mention incident-handling questions like “how did we figure out something was wrong here?”

In the wake of an incident, if we focus only on the weaknesses of individual components then we won’t see the systemic issues. And it’s the systemic will continue to bite us long after we’ve implemented all of those follow-up action items. We’ll never see the forest for the trees.