The Canva outage: another tale of saturation and resilience

Today’s public incident writeup comes courtesy of Brendan Humphries, the CTO of Canva. Like so many other incidents that came before, this is another tale of saturation, where the failure mode involves overload. There’s a lot of great detail in Humpries’s write-up, and I recommend you read it directly in addition to this post.

What happened at Canva

Trigger: deploying a new version of a page

The trigger for this incident was Canva deploying a new version of their editor page. It’s notable that there was nothing wrong with this new version. The incident wasn’t triggered by a bug in the code in the new version, or even by some unexpected emergent behavior in the code of this version. No, while the incident was triggered by a deploy, the changes from the previous version are immaterial to this outage. Rather, it was the system behavior that emerged from clients downloading the new version that led to the outage. Specifically, it was clients downloading the new javascript files from the CDN that set the ball in motion.

A stale traffic rule

Canva uses Cloudflare as their CDN. Being a CDN, Cloudflare has datacenters all over the world., which are interconnected by a private backbone. Now, I’m not a networking person, but my basic understanding of private backbones is that CDNs lease fibre-optic lines from telecom companies and use these leased lines to ensure that they have dedicated network connectivity and bandwidth between their sites.

Unfortunately for Canva, there was a previously unknown issue on Cloudflare’s side: Cloudflare Wasn’t using their dedicated fibre-optic lines to route traffic between their Northern Virginia and Singapore datacenters. That traffic was instead, unintentionally, going over the public internet.

[A] stale rule in Cloudflare’s traffic management system [that] was sending user IPv6 traffic over public transit between Ashburn and Singapore instead of its default route over the private backbone.

Traffic between Northern Virginia (IAD) and Singapore (SIN) was incorrectly routed over the public network

The routes that this traffic took suffered from considerable packet loss. For Canva users in Asia, this meant that they experienced massive increases in latency when their web browsers attempted to fetch the javascript static assets from the CDN.

A stale rule like this is the kind of issue that the safety researcher James Reason calls a latent pathogen. It’s a problem that remains unnoticed until it emerges as a contributor to an incident.

High latency synchronizes the callers

Normally, an increase in errors would cause our canary system to abort a deployment. However, in this case, no errors were recorded because requests didn’t complete. As a result, over 270,000+ user requests for the JavaScript file waited on the same cache stream. This created a backlog of requests from users in Southeast Asia.

The first client attempts to fetch the new Javascript files from the CDN, but the files aren’t there yet, the CDN must fetch the files from the origin. Because of the added latency, this takes a long time.

During this time, other clients connect, and attempt to fetch the javascript from the CDN. But the CDN has not yet been populated with the files from the origin, that transfer is still in progress.

As Cloudflare notes in this blog post, when all subsequent clients request access to a file that is in the process of being populated in the cache, they must wait until the file has been cached before they can download the file. Except that Cloudflare has implemented functionality called Concurrent Streaming Acceleration which permits multiple clients to simultaneously download a file that is still in the process of being downloaded from the origin server.

The resulting behavior is that the CDN now behaves effectively as a barrier, with all of the clients slowly but simultaneously downloading the assets. With a traditional barrier, the processes who are waiting can proceed once all processes have entered in the barrier. This isn’t quite the same, as the clients who are waiting can all proceed once the CDN completes downloading the asset from the origin.

The transfer completes, the herd thunders

At 9:07 AM UTC, the asset fetch completed, and all 270,000+ pending requests were completed simultaneously.

20 minutes after Canva deployed the new Javascript assets to the origin server, the clients completed fetching them. The next action the clients take is to call Canva’s API service.

With the JavaScript file now accessible, client devices resumed loading the editor, including the previously blocked object panel. The object panel loaded simultaneously across all waiting devices, resulting in a thundering herd of 1.5 million requests per second to the API Gateway — 3x the typical peak load.

There’s one more issue that made this situation even worse: a known performance issue in the API gateway that was slated to be fixed.

A problematic call pattern to a library reduces service throughput

The API Gateways use an event loop model, where code running on event loop threads must not perform any blocking operations.

Two common threading models for request-response services are thread-per-request and async. For services that are I/O-bound (i.e., most of the time servicing each request is spent waiting for I/O operations to complete, typically networking operations), the async model has the potential to achieve better throughput. That’s because the concurrency of the thread-per-request model is limited by the number of operating-system threads. The async model services multiple requests per thread, and so it doesn’t suffer from the thread bottleneck. Canva’s API gateway implements the async model using the popular Netty library.

One of the drawbacks of the async model is the risk associated with the active thread getting blocked, because this can result in a significant performance penalty. The async model multiplexes multiple requests across an individual thread, and none of those requests can make progress when that thread is blocked. Programmers writing code in a service that uses the async model need to take care to minimize the number of blocking calls.

Prior to this incident, we’d made changes to our telemetry library code, inadvertently introducing a performance regression. The change caused certain metrics to be re-registered each time a new value was recorded. This re-registration occurred under a lock within a third-party library.

In Canva’s case, the API gateway logic was making calls to a third-party telemetry library. They were calling the library in such a way that it took a lock, which is a blocking call. This reduced the effective throughput that the API gateway could handle.

Calls to the library led to excessive thread locking

Although the issue had already been identified and a fix had entered our release process the day of the incident, we’d underestimated the impact of the bug and didn’t expedite deploying the fix. This meant it wasn’t deployed before the incident occurred.

Ironically, they were aware of this problematic call pattern, and they were planning on deploying a fix the day of the incident(!).

As an aside, it’s worth noting the role of telemetry logic behavior in the recent OpenAI incident, and in the locking behavior of tracing library in a complex performance issue that Netflix experienced. Observability giveth reliability, and observability taketh reliability away.

Canva is now in a situation where the API gateway is receiving much more traffic than it was provisioned to handle, is also suffering from a performance regression that reduces its ability to handle traffic even more.

Now let’s look at how the system behaved under these conditions.

The load balancer turns into an overload balancer

Because the API Gateway tasks were failing to handle the requests in a timely manner, the load balancers started opening new connections to the already overloaded tasks, further increasing memory pressure.

A load balancer sits in front of a service and distributes the incoming requests across the units of compute. Canva runs atop ECS, so the individual units are called tasks, and the group is called a cluster (you can think of these as being equivalent to pods and replicasets in Kubernetes-land).

The load balancer will only send requests to a task that is healthy. If a task is unhealthy, then it stops being considered as a candidate target destination for the load balancer. This yields good results if the overall cluster is provisioned to handle the load: the traffic gets redirected away from the unhealthy tasks and onto the healthy ones.

Load balancer only sells traffic to the healthy tasks

But now consider the scenario where all of the tasks are operating close to capacity. As tasks go unhealthy, the load balancer will redistribute the load to the remaining “healthy” tasks, which increases the likelihood those tasks gets pushed into an unhealthy state.

Redirecting traffic to the almost-overloaded healthy nodes will push them over

This is a classic example of a positive feedback loop: the more tasks go unhealthy, the more traffic the healthy nodes received, the more likely those tasks will go unhealthy as well.

Autoscaling can’t keep pace

So, now the system is saturated, and the load balancer is effectively making things worse. Instead of shedding load, it’s concentrating load on the tasks that aren’t overloaded yet.

Now, this is the cloud, and the cloud is elastic, and we have a wonderful automation system called the autoscaler that can help us in situations of overload by automating provisioning new capacity.

Only, there’s a problem here, and that’s that the autoscaler simply can’t scale up fast enough. And the reason it can’t scale up fast enough is because of another automation system that’s intended to help in times of overload: Linux’s OOM killer.

The growth of off-heap memory caused the Linux Out Of Memory Killer to terminate all of the running containers in the first 2 minutes, causing a cascading failure across all API Gateway tasks. This outpaced our autoscaling capability, ultimately leading to all requests to canva.com failing.

Operating systems need access to free memory in order to function properly. When all of the memory is consumed by running processes, the operating system runs into trouble. To guard against this, Linux has a feature called the OOM killer which will automatically terminate a process when the operating system is running too low on memory. This frees up memory, enabling the OS to keep functioning.

So, you have the autoscaler which is adding new tasks, and the OOM killer which is quickly destroying existing tasks that have become overloaded.

It’s notable that Humphries uses the term outpaced. This sort of scenario is a common failure mode in complex system failures, where the system gets into a state where it can’t keep up. This phenomenon is called decompensation. Here’s resilience engineering pioneer David Woods describing decompensation on John Willis’s Profound Podcast:

And lag is really saturation in time. That’s what we call decompensation, right? I can’t keep pace, right? Events are moving forward faster. Trouble is building and compounding faster than I, than the team, than the response system can decide on and deploy actions to affect. So I can’t keep pace. – David Woods

Adapting the system to bring it back up

At this point, the API gateway cluster is completely overwhelmed. From the timeline:

9:07 AM UTC – Network issue resolved, but the backlog of queued requests result in a spike of 1.5 million requests per second to the API gateway.

9:08 AM UTC – API Gateway tasks begin failing due to memory exhaustion, leading to a full collapse.

When your system is suffering from overload, there are basically two strategies:

increase the capacity
reduce the load

Wisely, the Canva engineers pursued both strategies in parallel.

Max capacity, but it still isn’t enough

Montgomery Scott, my nominee for patron saint of resilience engineering

We attempted to work around this issue by significantly increasing the desired task count manually. Unfortunately, it didn’t mitigate the issue of tasks being quickly terminated.

The engineers tried to increase capacity manually, but even with the manual scaling, the load was too much: the OOM killer was taking the tasks down too quickly for the system to get back to a healthy state.

Load shedding, human operator edition

The engineers had to improvise a load shedding solution in the moment. The approach they took was to block traffic the CDN layer, using Cloudflare.

At 9:29 AM UTC, we added a temporary Cloudflare firewall rule to block all traffic at the CDN. This prevented any traffic reaching the API Gateway, allowing new tasks to start up without being overwhelmed with incoming requests. We later redirected canva.com to our status page to make it clear to users that we were experiencing an incident.

It’s worth noting here that while Cloudflare contributed to this incident with the stale rule, the fact that they could dynamically configure Cloudflare firewall rules meant that Cloudflare also contributed to the mitigation of this incident.

Ramping the traffic back up

Here they turned off all of their traffic to give their system a chance to go back to healthy. But a healthy system under zero load behaves differently from a healthy system under typical load. If you just go back from zero to typical, there’s a risk that you push the system back into an unhealthy state. (One common problem is that autoscaling will have scaled down multiple services due when there’s no load).

Once the number of healthy API Gateway tasks stabilized to a level we were comfortable with, we incrementally restored traffic to canva.com. Starting with Australian users under strict rate limits, we gradually increased the traffic flow to ensure stability before scaling further.

The Canva engineers had the good judgment to ramp up the traffic incrementally rather than turn it back on all at once. They started restoring at 9:45 AM UTC, and were back to taking full traffic at 10:04 AM.

Some general observations

All functional requirements met

I always like to call out situations where, from a functional point of view, everything was actually working fine. In this case, even though there was a stale rule in the Cloudflare traffic management system, and there was a performance regression in the API gateway, everything was working correctly from a functional perspective: packets were still being routed between Singapore and Northern Virginia, and the API gateway was still returning the proper responses for individual requests before it got overloaded.

Rather, these two issues were both performance problems. Performance problems are much harder to spot, and the worst are the ones that you don’t notice until you’re under heavy load.

The irony is that, as an organization gets better at catching functional bugs before they hit production, more and more of the production incidents they face will be related to these more difficult-to-detect-early performance issues.

Automated systems made the problem worse

There were a number of automated systems in play whose behavior made this incident more difficult to deal with.

The Concurrent Streaming Acceleration functionality synchronized the requests from the clients. The OOM killer reduced the time it took for a task to be seen as unhealthy by the load balancer, and the load balancer in turn increased the rate at which tasks went unhealthy.

None of these systems were designed to handle this sort of situation, so they could not automatically change their behavior.

The human operators changed the way the system behaved

It was up to the incident responders to adapt the behavior of the system, to change the way it functioned in order to get it back to a healthy state. They were able to leverage an existing resource, Cloudflare’s firewall functionality, to accomplish this. Based on the description of the action items, I suspect they had never used Cloudflare’s firewall to do this type of load shedding before. But it worked! They successfully adapted the system behavior.

We’re building a detailed internal runbook to make sure we can granularly reroute, block, and then progressively scale up traffic. We’ll use this runbook to quickly mitigate any similar incidents in the future.

This is a classic example of resilience, of acting to reconfigure the behavior of your system when it enters a state that it wasn’t originally designed to handle.

As I’ve written about previously, Woods talks about the idea of a competence envelope. The competence envelope is sort of a conceptual space of the types of inputs that your system can handle. Incidents occur when your system is pushed to operate outside of its competence envelope, such as when it gets more load than it is provisioned to handle:

The competence envelope is a good way to think about the difference between robustness and resilience. You can think of robustness as describing the competence envelope itself: a more robust system may have a larger competence envelope, it is designed to handle a broader range of problems.

However, every system has a finite competence envelope. The difference between a resilient and a brittle system is how that system behaves when it is pushed just outside of its competence envelope.

Incidents happen when the system is pushed outside of its competence envelope

A resilient system can change the way it behaves when pushed outside of the competence envelope due to an incident in order to extend the competence envelope so that it can handle the incident. That’s why we say it has adaptive capacity. On the other hand, a brittle system is one that cannot adapt effectively when it exceeds its competence envelope. A system can be very robust, but also brittle: it may be able to handle a very wide range of problems, but when it faces a scenario it wasn’t designed to handle, it can fall over hard.

The sort of adaptation that resilience demands requires human operators: our automation simply doesn’t have a sophisticated enough model of the world to be able to handle situations like the one that Canva found itself in.

In general, action items after an incident focus on expanding the competence envelope: making changes to the system to handle the scenario that just happened. Improving adaptive capacity involves different kind of work than improving system robustness.

We need to build in the ability to reconfigure our systems in advance, without knowing exactly what sorts of changes we’ll need to make. The Canva engineers had some powerful operational knobs at their disposal through the Cloudflare firewall configuration. This allowed them to make changes. The more powerful and generic these sorts of dynamic configuration features are, the more room for maneuver we have. Of course, dynamic configuration is also dangerous, and is itself a contributor to incidents. Too often we focus solely on the dangers of such functionality in creating incidents, without seeing its ability to help us reconfigure the system to mitigate incidents.

Finally, these sorts of operator interfaces are of no use if the responders aren’t familiar with them. Ultimately, the more your responders know about the system, the better position they’ll be in to implement these adaptations. Changing an unhealthy system is dangerous: no matter how bad things are, you can always accidentally make things worse. The more knowledge about the system you can bring to bear during an incident, the better position you’ll be in to adaptive your system to extend that competence envelope.

Your lying virtual eyes

Well, who you gonna believe, me or your own eyes? – Chico Marx (dressed as Groucho), from Duck Soup:

In the ACM Queue article Above the Line, Below the Line, the late safety researcher Richard Cook (of How Complex Systems Fail fame) notes how that we software operators don’t interact directly with the system. Instead, we interact through representations. In particular, we view representations of internal state of the system, and we manipulate these representations in order to effect changes, to control the system. Cook used the term line of representation to describe the split between the world of the technical (software) system and the world of the people who work with the technical system. The people are above the line of representation, and the technical system is below the line.

Above the line of representation are the people, organizations, and processes that shape, direct, and restore the technical artifacts that lie below that line.People who work above the line routinely describe what is below the line using concrete, realistic language.

Yet, remarkably, nothing below the line can be seen or acted upon directly. The displays, keyboards, and mice that constitute the line of representation are the only tangible evidence that anything at all lies below the line. All understandings of what lies below the line are constructed in the sense proposed by Bruno Latour and Steve Woolgar. What we “know”—what we can know—about what lies below the line depends on inferences made from representations that appear on the screens and displays.

In short, we can never actually see or change the system directly, all of our interactions mediated through software interfaces.

René Magritte would have appreciated Cook’s article

In this post, I want to talk about how this fact can manifest as incidents, and that our solutions rarely consider this problem. Let’s start off, as we so often do in the safety world, with the Three Mile Island accident.

Three Mile Island and the indicator light

I assume the reader has some familiarity with the partial meltdown that occurred at the Three Mile Island nuclear plant back in 1979. As it happens, there’s a great series of lectures by Cook on accidents. The topic of his first lecture is about how Three Mile Island changed the way safety specialists thought about the nature of accidents.

Here I want to focus on just one aspect of this incident: a particular indicator light in the Three Mile Island control room. During this incident, there was a type of pressure relief valve called a pilot-operated relief valve (PORV) that was stuck open. However, the indicator light for the state of this valve was off, which the operators interpreted (incorrectly, alas) as the valve being closed. Here I’ll quote the wikipedia article:

A light on a control panel, installed after the PORV had stuck open during startup testing, came on when the PORV opened. When that light—labeled Light on – RC-RV2 open —went out, the operators believed that the valve was closed. In fact, the light when on only indicated that the PORV pilot valve’s solenoid was powered, not the actual status of the PORV. While the main relief valve was stuck open, the operators believed the unlighted lamp meant the valve was shut. As a result, they did not correctly diagnose the problem for several hours.

What I found notable was the article’s comment about lack of operator training to handle this specific scenario, a common trope in incident analysis.

The operators had not been trained to understand the ambiguous nature of the PORV indicator and to look for alternative confirmation that the main relief valve was closed. A downstream temperature indicator, the sensor for which was located in the tail pipe between the pilot-operated relief valve and the pressurizer relief tank, could have hinted at a stuck valve had operators noticed its higher-than-normal reading. It was not, however, part of the “safety grade” suite of indicators designed to be used after an incident, and personnel had not been trained to use it. Its location behind the seven-foot-high instrument panel also meant that it was effectively out of sight.

Now, consider what happens if the agent acting on these sensors is an automated control system instead of a human operator.

Sensors, automation, and accidents: cases from aviation

In the aviation world, we have a combination of automation and human operators (pilots) who work together in real-time. The assumption is that if something goes wrong with the automation, the human can quickly take over and deal with the problem. But automation can make things too difficult for a human to be able to compensate for, and automation can be particularly vulnerable to sensor problems, as we can see in the following accidents:

Bombardier Learjet 60 accident, 2008

On September 19, 2008, in Columbia, South Carolina, a Bombardier Learjet 60 overran the runway during a rejected takeoff. As a consequence, four people aboard the plane, including the captain and first officer, were killed. In this case, the sensor issues were due to damage to electronics in the wheel well area after underinflated tires on the landing gear exploded.

The pilots reversed thrust to slow down the plane. However, the tires on the plane were under-inflated, and they exploded. As a result of the tire explosion, sensors in the wheel well area of the plane were damaged.

The thrust reverse system relies on sensor data to determine whether reversing thrust is a safe operation. Because of the sensor damage, the system determined that it was not safe to reverse thrust, and instead increased forward thrust. From the NTSB report:

In this situation, the EECs would transition from the reverse thrust power schedule to the
forward thrust power schedule during about a 2-second transition through idle power. During the entire sequence, the thrust reverser levers in the cockpit would remain in the reverse thrust idle position (as selected by the pilot) while the engines produced forward thrust. Because both the thrust reverser levers and the forward thrust levers share common RVDTs (one for the left engine and one for the right engine), the EECs, which receive TLA information from the RVDTs, would signal the engines to produce a level of forward thrust that generally corresponds with the level of reverse thrust commanded; that is, a pilot commanding full reverse thrust (for maximum deceleration of the airplane) would instead receive high levels of forward thrust (accelerating the airplane) according to the forward thrust power schedule

(My initial source for this was John Thomas’s slides.)

Air France 447, 2009

On June 1, 2009, Air France 447 crashed, killing all passengers and crew. The plane was an Airbus A330-200. In this accident, the sensor problem is believed to be caused by ice crystals that accumulated inside of pitot tube sensors, creating a blockage which lead to erroneous readings. Here’s a quote from an excellent Vanity Fair article on the crash:

Just after 11:10 P.M., as a result of the blockage, all three of the cockpit’s airspeed indications failed, dropping to impossibly low values. Also as a result of the blockage, the indications of altitude blipped down by an unimportant 360 feet. Neither pilot had time to notice these readings before the autopilot, reacting to the loss of valid airspeed data, disengaged from the control system and sounded the first of many alarms—an electronic “cavalry charge.” For similar reasons, the automatic throttles shifted modes, locking onto the current thrust, and the fly-by-wire control system, which needs airspeed data to function at full capacity, reconfigured itself from Normal Law into a reduced regime called Alternate Law, which eliminated stall protection and changed the nature of roll control so that in this one sense the A330 now handled like a conventional airplane. All of this was necessary, minimal, and a logical response by the machine.

This is what the safety researcher David Woods refers to as bumpy transfer of control, where the humans must suddenly and unexpectedly take over control of an automated system, which can lead to disastrous consequences.

Boeing 737 MAX 8 (2018, 2019)

On October 29, 2018, Lion Air Flight 610 crashed thirteen minutes after takeoff, killing everyone on board. Five months later, on March 10, 2019, Ethiopian Airlines Flight 302 crashed six minutes after takeoff, also killing everyone on board. Both planes were Boeing 737 MAX 8. In both cases, the sensor problem was related to the angle-of-attack (AOA) sensor.

Lion Air Flight 610 investigation report:

The replacement AOA sensor that was installed on the accident aircraft had
been mis-calibrated during an earlier repair. This mis-calibration was not
detected during the repair.

Ethiopian Airline Flight 302 investigation report:

Shortly after liftoff, the left Angle of Attack sensor recorded value became erroneous and the left stick shaker activated and remained active until near the end of the recording.

An automation subsystem in the 737 MAX called Maneuvering Characteristics Augmentation System (MCAS) automatically pushed the nose down in response to the AOA sensor data.

What should we take away from these?

Here I’ve given examples from aviation, but sensor-automation problems are not specific to that domain. Here are a few of my own takeaways.

We designers can’t assume sensor data will be correct

The kinds of safety automation subsystems we build in tech are pretty much always closed-loop control systems. When designing such systems in the tech world, how often have you heard someone ask, “what happens if there’s a problem with the sensor data that the system is reacting to?”

This goes back to the line of representation problem: that no agent ever gets access to the true state of the system, it only gets access to some sort of representation. The irony here is that it doesn’t just apply to humans (above the line) making sense of signals, it also applies to technical system components (below the line!) making sense of signals from other technical components.

Designing a system that is safe in the face of sensor problems is hard

Again, from the NTSB report of the Learjet 60 crash:

Learjet engineering personnel indicated that the uncommanded stowage of the thrust reversers in the event of any system loss or malfunction is part of a fail-safe design that ensures that a system anomaly cannot result in a thrust reverser deployment in flight, which could adversely affect the airplane’s controllability. The design is intended to reduce the pilot’s emergency procedures workload and prevent potential mistakes that could exacerbate an abnormal situation.

The thrust reverser system behavior was designed by aerospace engineers to increase safety, and ended up making things worse! Good luck imagining all of these sorts of scenarios when you design your systems to increase safety.

Even humans struggle in the face of sensor problems

People are better equipped to handle sensor problems than automation, because we don’t seem to be able to build automation that can handle all of the possible kinds of sensor problems that we might throw at a problem.

But even for humans, sensor problems are difficult. While we’ll eventually figure out what’s going on, we’ll still struggle in the face of conflicting signals, as anyone who has responded to an incident can tell you. And in high-tempo situations, where we need to respond quickly enough or something terrible will happen (like in the Air France 447 case), we simply might not be able to respond quickly enough.

Instead of focusing on building the perfect fail-safe system to prevent this next time, I wish we’d spend more time thinking about, “how can we help the human figure out what the heck is happening when the input signals don’t seem to make sense”.

Second-class interactions are a first-class risk

Below is a screenshot of Vizceral, a tool that was built by a former teammate of mine at Netflix. It provides a visualization of the interactions between the various microservices.

Vizceral uses moving dots to depict how requests are currently flowing through the Netflix microservice architecture. Vizceral is able to do its thing because of the platform tooling, which provides support for generating a visualization like this by exporting a standard set of inter-process communication (IPC) metrics.

What you don’t see depicted here are the interactions between those microservices and the telemetry platform that ingest these metrics. There’s also logging and tracing data, and those get shipped off-box via different channels, but none of those channels show up in this diagram.

In fact, this visualization doesn’t represent interactions with any of the platform services. You won’t see bubbles that represent the compute platform or the CI/CD platform represented in a diagram like this, even though those platform services all interact with these application services in important ways.

I call the first category of interactions, the ones between the application services, as first-class, and the second category, the ones where the interactions involve platform services, as second-class. It’s those second-class interactions that I want to say more about.

These second-class interactions tend to have a large blast radius, because successful platforms by their nature have a large blast radius. There’s a reason why there’s so much havoc out in the world when AWS’s us-east-1 region has a problem: because so many services out there are using us-east-1 as a platform. Similarly, if you have a successful platform within your organization, then by definition it’s going to see a lot of use, which means that if it experiences a problem, it can do a lot of damage.

These platforms are generally more reliable than the applications that run atop them, because they have to be: platforms naturally have higher reliability requirements than the applications that run atop them. They have these requirements because they have a large blast radius. A flaky platform is a platform that contributes to multiple high-severity outages, and systems that contribute to multiple high-severity outages are the systems were reliability work gets prioritized.

And a reliable system is a system whose details you aren’t aware of, because you don’t need to be. If my car is very reliable, then I’m not going to build an accurate mental model of how my car works, because I don’t need to: it just works. In her book Human-Machine Reconfigurations: Plans and Situated Actions, the anthropologist Lucy Suchman used the term representation to describe the activity of explicitly constructing a mental model of how a piece of technology works, and she noted that this type of cognitive work only happens when we run into trouble. As Suchman puts it:

[R]epresentation occurs when otherwise transparent activity becomes in some way problematic

Hence the irony: these second-class interactions tend not to be represented in our system models when we talk about reliability, because they are generally not problematic.

And so we are lulled into a false sense of security. We don’t think about how the plumbing works, because the plumbing just works. Until the plumbing breaks. And then we’re in big trouble.

Book Review: Trust in Numbers

Trust in Numbers: The Pursuit of Objectivity in Science and Public Life by Theodore Porter, Distinguished Professor Emeritus of History, UCLA.

There are two general approaches to decision-making. One way is to make a judgment call. Informally, you could call this “trusting your gut”. Formally, you could describe this as a subjective, implicit process. The other way is to use an explicit approach that relies on objective, quantitative data, for example, doing a return-on-investment (ROI) calculation on a proposed project to decide whether to undertake the project. We use the term rigorous to describe these type of approaches, and we generally regard them as superior.

Here, Porter argues that quantitative, rigorous decision-making in a field is not a sign of its maturity, but rather its political weakness. In fields where technical professionals enjoy a significant amount of trust, these professionals do decision-making using personal judgment. While professionals will use quantitative data as input, their decisions are ultimately based on their own subjective impressions. (For example, see Julie Gainsburg’s notion of skeptical reverence in The Mathematical Disposition of Structural Engineers). In Porter’s account, we witnessed an increase of rigorous decision-making approaches in the twentieth century because of a lack of trust in certain professional fields, not because the quantitative approaches yielded better results.

It’s only in fields where the public does not grant deference to professionals that they are compelled to use explicit, objective processes to make the decisions. They are forced to show their work in a public way because they aren’t trusted. In some cases, a weak field adopts rigor to strengthen itself in the eyes of the public, such as experimental psychology’s adoption of experimental rigor (in particular, ESP research). Most of the case studies in the book come from areas where a field was compelled to adopt objective approaches because there was explicit political pressure and the field did not have sufficient power to resist.

In some cases, professionals did have the political clout to push back. An early chapter of the book discusses a problem that the British parliament wrestled with in the late nineteenth century: unreliable insurance companies that would happily collect premiums but then would eventually fail and would hence be unable to pay out when their customers submitted claims. A parliamentary committee formed and heard testimony from actuaries about how the government could determine whether an insurance company was sound. The experienced actuaries from reputable companies argued that it was not possible to define an objective procedure for assessing a company. They insisted that “precision is not attainable through actuarial methods. A sound company depends on judgment and discretion.” They were concerned that a mechanical, rule-based approach wouldn’t work:

Uniform rules of calculation, imposed by the state, might yield “uniform errors.” Charles Ansell, testifying before another select committee a decade earlier, argued similarly, then expressed his fear that the office of government actuary would fall to “some gentlemen of high mathematical talents, recently removed from one of our Universities, but without any experience whatever, though of great mathematical reputation.” This “would not qualify him in any way whatever for expressing a sound opinion on a practical point like that of the premiums in a life assurance.”
Trust in Numbers, pp108-109

Porter tells a similar story about American accountants. To stave off having standardized rules imposed on them, the American Institute of Accountants defined standards for its members, but these were controversial. One accountant, Walter Wilcox, argued in 1941 that “Cost is not a simple fact, but is a very elusive concept… Like other aspects of accounting, costs give a false impression of accuracy.” Similarly, when it came to government-funded projects, the political pressure was simply too strong to defer to government civil engineers, such as the French civil engineers who had to help decide which rail projects should be funded, or the U.S. Army Corps of Engineers who had to help make similar decisions about waterway projects such as dams and reservoirs. In the U.S., they settled on a cost-benefit analysis process, where the return on investment had to exceed 1.0 in order to justify a project. But, unsurprisingly, there were conflicts over how benefits were quantified, as well as over how to classify costs. While the output may have been a number, and the process was ostensibly objective, because it needed to be, ultimately these numbers were negotiable and assessments changed as a function of political factors.

In education, teachers were opposed to standardized testing, but did not have the power to overcome it. On the other hands, doctors were able to retain the use of their personal judgment for diagnosing patients. However, the regulators had sufficient power that they were able to enforce the use of objective measures for evaluating drugs, and hence were able to oversee some aspect of medical practice.

This tug of war between rigorous, mechanical objectivity and élite professional autonomy continues to this day. Professionals say “This requires private knowledge; trust us”. Sometimes, the public says “We don’t trust you anymore. Make the knowledge public!”, and the professionals have no choice but to relent. On the subject of whether we are actually better off when we trade away judgment for rigor, Porter is skeptical. I agree.

Consistency

“Welcome aboard to BigCo!”

“Thanks! I’m excited to be here. This is my first tech job, even if it is just an internship.”

“We’re going to start you off with some automated testing. You’re familiar with queues, right?”

“The data structure? Sure thing. First in, first out.”

“Great! We need some help validating that our queueing module is always working properly. We have a bunch of test scenarios written, and we want need to someone to check that the observed behavior of the queue is correct.”

“So, for input, do I get something like a history of interactions with the queue? Like this?”

q.add("A") -> OK
q.add("B") -> OK
q.pop() -> "A"
q.add("C") -> OK
q.pop() -> "B"
q.pop() -> "C"

“Exactly! That’s a nice example of a correct history for a queue. Can you write a program that takes a history like that as input and returns true if it’s a valid history?”

“Sure thing.”

“Excellent. We’ll also need your help generating new test scenarios.”

A few days later

“I think I found a scenario where the queue is behaving incorrectly when it’s called by a multithreaded application. I got a behavior that looks like this:”

q.add("A") -> OK
q.add("B") -> OK
q.add("C") -> OK
q.pop() -> "A"
q.pop() -> "C"
q.pop() -> "B"

“Hmmm. That’s definitely incorrect behavior. Can you show me the code you used to generate the behavior?”

“Sure thing. I add the elements to the queue in one thread, and then I spawn a bunch of new threads and dequeue in the new threads. I’m using the Python bindings to call the queue. My program looks like this.”

from bigco import Queue
from threading import Thread

def pop_and_print(q):
    val = q.pop()
    print(val)

q = Queue()
q.add("a")
q.add("b")
q.add("c")

Thread(target=pop_and_print, args=[q]).run()
Thread(target=pop_and_print, args=[q]).run()
Thread(target=pop_and_print, args=[q]).run()

“And the output looked like this:”

A
C
B

“Well, that’s certainly not the order I expect the output to be printed in, but how do you know the problem is that the queue is actually behaving correctly? It might be that the values were dequeued in the correct order, but because of the way the threads are scheduled, the print statements were simply executed in a different order than you expect.”

“Hmmm. I guess you’re right: just looking at the order of the printed output doesn’t give me enough information to tell if the queue is behaving correctly or not. Let me try printing out the thread ids and the timestamps.”

[id0] [t=1] before pop
[id0] [t=2] after pop
[id0] [t=3] output: A
[id1] [t=4] before pop
[id2] [t=5] before pop
[id2] [t=6] after pop
[id2] [t=7] output: C
[id1] [t=8] after pop
[id1] [t=9] output: B

“Oh, I see what happened! The operations of thread 1 and thread 2 were interleaved! I didn’t think about what might happen in that case. It must have been something like this:”

[id0]                  [id1]                  [id2]
q.pop()->"A"
print("A")
                       q.pop()->"B"
                                              q.pop()->"C"
                                              print("C")
                       print("B")

“Well, it looks like the behavior is still correct, the items got dequeued in the expected order, it’s just that they got printed out in a different order.”

The next day

“After thinking through some more multithreaded scenarios, I ran into a weird situation that I didn’t expect. It’s possible that the “pop” operations overlap in time across the two different threads. For example, “pop” might start on thread 1, and then in the middle of the pop operation, the operating system schedules thread 2, and it starts in the middle.”


[id0]             [id1]                  [id2]
q.pop(): start
q.pop(): end
print("A")
                  q.pop(): start
                  |                      q.pop(): start
                  q.pop(): end           |
                                         q.pop(): end
                                         print("C")
                  print("B")

“Let’s think about this. If id1 and id2 overlap in time like this, what do you think the correct output should be? ‘ABC’ or ‘ACB’?”

“I have no idea. I guess we can’t say anything!”

“So, if the output was ‘ABB’, you’d consider that valid?”

“Wait, no… It can’t be anything. It seems like either ‘ABC’ or ‘ACB’ should be valid, but not “ABB”.

“How about ‘BCA’? Would that be valid here?”

“No, I don’t think so. There’s no overlap between the first pop operation and the others, so it feels like the pop in id0 should return “A”.

“Right, that makes sense. So, in a concurrent world, we have potentially overlapping operations, and that program you wrote that checks queue behaviors doesn’t have any notion of overlap in it. So we need to be able to translate these potentially overlapping histories into the kind of sequential history your program can handle. Based on this conversation, we can use two rules:

1. If two operations don’t overlap (like the pop in id0 and the pop in id1) in time, then we use the time ordering (id0 happened before id1).

2. If two operations do overlap in time, then either ordering is valid.

“So, that means that when I check whether a multithreaded behavior is valid, I need to actually know the time overlap of the operations, and then generate multiple possible sequential behaviors, and check to see if the behavior that I witnesses corresponds to one of those?”

“Yes, exactly. This is a consistency model called linearizability. If our queue has linearizable consistency, that means that for any behavior you witness, you can define a linearization, an equivalent sequential behavior. Here’s an example.”

[id0]             [id1]                  [id2]
q.add("a")
q.add("b")
q.add("c")

q.pop(): start
q.pop()->"A"
                  q.pop(): start
                  |                      q.pop(): start
                  |                      q.pop()->"C"
                  q.pop()->"B"

“The question is: can we generate a linearization based on the two rules above? We can! Because the “id1” and “id2” overlap, we can generate a linearization where the “id1″ operation happens first. One way to think about it is to identify a point in time between the start and end of the operation and pretend that’s when the operation really happens. I’ll mark these points in time with an ‘x’ in the diagram.

[id0]             [id1]                  [id2]
q.add("a")
q.add("b")
q.add("c")

q.pop(): start
x
q.pop()->"A"
                  q.pop(): start
                                         q.pop(): start
                  x
                                         x
                                         q.pop()->"C"
                  q.pop()->"B"

“Now we can rewrite this as a linear history.”

q.add("a")
q.add("b")
q.add("c")
q.pop()->"A"
q.pop()->"B"                                      
q.pop()->"C"

Going distributed

“We’re expanding our market. We’re building on our queue technology to build a distributed queue. We’re also providing a new operation: “get”. When you call “get” on a distributed queue, you get the entire contents of the queue, in queue order.”

“Oh, so a valid history would be something like this?”

q.add("A") 
q.add("B")
q.get() -> ["A","B"]
q.add("C")
q.get() -> [A","B","C"]

“Exactly! One use case we’re targeting is using our queue for implementing online chat, so the contents of a queue might look like this:”

["Alice: How are you doing?",
 "Bob: I'm fine, Alice. How are you?",
 "Alice: I'm doing well, thank you."]

CAPd

“OK, I did some testing with the distributed queue. ran into a problem with the distributed queue. Look at this history, it’s definitely wrong. Note that the ids here are process ids, not thread ids, because we’re running on different machines.


[id0]                         [id1]
q.add("Alice: Hello"): start
q.add(...) -> OK
                              q.add("Bob: "Hi"): start
                              q.add(...)->OK
                              q.get(): start
                              q.get()-> ["Bob: Hi"]

“When process 1 called ‘get’, it didn’t see the “Alice: Hello” entry, and that operation completed before the ‘get’ started! This history isn’t linearizable!”

“You’re right, our distributed queue isn’t linearizable. Note that we could modify this history to make it linearizable if process 0’s add operation did not complete until after the get:

[id0]                         [id1]
q.add("Alice: Hello"): start

                              q.add("Bob: "Hi"): start
                              q.add(...) -> OK
                              q.get(): start
                              q.get()-> ["Bob: Hi"]
q.add(...) -> OK

“Now we can produce a valid linearization from the history”

q.add("Bob: "Hi")
q.get()->["Bob: Hi"]
q.add("Alice: Hello")

“But look what we had to do: we had to delay the completion of that add operation. This is the lesson of the CAP theorem: if you want your distributed object to have linearizable consistency, then some operations might take an arbitrarily long time to complete. With our queue, we decided to prefer availability, so that all operations are guaranteed to complete within a certain period of time. Unfortunately, once we give up on linearizability, things can get pretty weird. Let’s see how many different types of weird things you can find.”

Monotonic reads

“Here’s a weird one. The ‘Hi’ message disappeared in the second read!”

[id0]              [id1]                  [id2]
                   q.add("A: Hello")
                                         q.add("B: Hi")
q.get()->["A: Hello", "B: Hi"]
q.get()->["A: Hello"]

“Yep, this violates a property called monotonic reads. Once process 0 has seen the effect of the add(“B: Hi”) operation, we expect that it will always see it in the future. This is an example of a session property. If the two gets happened on two different processes, this would not violate the monotonic reads property. For example, the following history doesn’t violate monotonic reads, even though the operations and ordering are the same. That’s because one of the gets is in process 0, and the other is in process 1, and the monotonic reads property only applies to reads within the same process.

[id0]              [id1]                  [id2]
                   q.add("A: Hello")
                                         q.add("B: Hi")
q.get()->["A: Hello", "B: Hi"]
                   q.get()->["A: Hello"]

“All right, let’s say we can guarantee monotonic reads. What other kinds of weirdness happen?”

Read your writes

[id0]
q.add("A: Hello")
q.get() -> []

“Read your writes is one of the more intuitive consistency properties. If a process writes data, and then does a read, it should be able to see the effective of the write. Here we did a write, but we didn’t see it.”

Writes follow reads

[id0]
q.get() -> []
q.get() -> ["A: Hello"]
q.add("A: Hello")

“Here’s a case where read-your-writes isn’t violated (in fact, we don’t do any reads after the write), but something very strange has happened. We saw the effect of our write before we actually did the write! This violates the writes follow reads property. This also called session causality, and you can see why: when it was violated, we saw the effect before the cause!”

Monotonic writes

[id0]                      [id1]
q.add("A: Hi there!")
q.add("A: How are you?")
                           q.get() -> ["A: How are you?"]

“Hey, process 1 saw the ‘How are you?’ but not the ‘Hi there!’, even though they both came from process 0.”

“Yep. It’s weird that process 1 saw the second write from process 0, but it didn’t see the first write. This violates the monotonic writes property. Note that if the two writes were from different processes, this would not violate the property. For example, this would be fine:

[id0]                      [id1]
q.add("A: Hi there!")
                           q.add("A: How are you?")
                           q.get() -> ["A: How are you?"]

Consistent prefix

[id0]              [id1]
q.add("A: Hello")
                   q.add("B: Hi")
                   q.get()->["B: Hi"]
                   q.get()->["A: Hello", "B: Hi"]

“From process 1’s perspective, it looks like the history of the chat log changed! Somehow, ‘A: Hello’ snuck in before ‘B: Hi’, even though process 1 had already seen ‘B: Hi’.”

“Yes, this violates a property called consistent prefix. Note that this is different from monotonic reads, which is not violated in this case. (Sadly, the Jepsen consistency page doesn’t have an entry for consistent prefix).

Reasoning about correctness in a distributed world

One way to think about what it means for a data structure implementation to be correct is to:

Define what it means for a particular execution history to be correct
Check that every possible execution history for the implementation satisfies this correctness criteria.

Step 2 requires doing a proof, because in general there are too many possible execution histories for us to check exhaustively. But, even if we don’t actually go ahead and do the formal proof, it’s still useful to think through step 1: what it means for a particular execution history to be correct.

As we move from sequential data structures to concurrent (multithreaded) ones and then distributed ones, things get a bit more complicated.

Recall that for the concurrent case, in order to check that a particular execution history was correct, we had to see if we could come up with a linearization. We had to try and identify specific points in time when operations took effect to come up with a sequential version of the history that met our sequential correctness criteria.

In Principles of Eventual Consistency, Sebastian Burckhardt proposed a similar type of approach for validating the execution history of a distributed data structure. (This is the approach that Viotti & Vukolic extended. Kyle Kingsbury references Viotti and Vukolic on the Jepsen consistency models page that I’ve linked to several times here).

Execution histories as a set of events

To understand Burckhardt’s approach, we first have to understand how he models a distributed data structure execution history. He models an execution history as a set of events, where each event has associated with it:

The operation (including arguments), e.g.:
- get()
- add(“Hi”)
A return value, e.g.
- [“Hi”, “Hello”]
- OK

He also defines two relations on these events, returns-before and same-session.

Returns-before

The returns-before (rb) relation models time. If there are two events, e1, e2, and (e1,e2) is in rb, that means that the operation associated with e1 returned before the operation associated with e2 started.

Let’s take this example, where the two add operations overlap in time:

[id0]              [id1]                  [id2]
                   add("A: Hello"):start
                   |                      add("B: Hi"):start
                   |                      add("B: Hi"):end
                   add("A: Hello"):end

 get()->["A: Hello", "B: Hi"]
                   get()->["A: Hello"]

I’ll use the following labeling for the events:

e1: add(“A: Hello”)
e2: add(“B: Hi”)
e3: get() -> [“A: Hello”, “B:Hi”]
e4: get() -> [“A: Hello”]

Here, rb={(e1,e3), (e1,e4),(e2,e3),(e2,e4),(e3,e4)}

Note that neither (e1,e2) nor (e2,e1) is in rb, because the two operations overlap in time. Neither one happens before the other.

Same-session

The same-session (ss) relation models the grouping of operations into processes. In the example above, there are three sessions (id0, id1, id2), and the same-session relation looks like this: ss={(e1,e1),(e1,e4),(e4,e1),(e4,e4),(e2,e2),(e3,e3)}. (Note: in this case, there are only two operations that are in the same session, e1 and e4

This is what the graph looks like with the returns-before (rb) and same-session (ss) relationship shown.

Explaining executions with visibility and arbitration

Here’s the idea behind Burckhardt’s approach. He defines consistency properties in terms of the returns-before (rb) relation, the same-session (ss) relation, and two other binary relations called visibility (vis) and arbitration (ar).

For example, an execution history satisfies read my writes if: (rb ∩ ss) ⊆ vis

In this scheme, an execution history is correct if we can come up with visibility and arbitration relations for the execution such that:

All of the consistency properties we care about are satisfied by our visibility and arbitration relations.
Our visibility and arbitration relations don’t violate any of our intuitions about causality.

You can think of coming up with visibility and arbitration relations for a history as coming up with an explanation for how the history makes sense. It’s a generalization of the process we used for linearizability where we picked a specific point in time where the operation took effect.

(1) tells us that we have to pick the right vis and ar (i.e., we have to pick a good explanation). (2) tells us that we don’t have complete freedom in picking vis and ar (i.e., our explanations have to make intuitive sense to human beings).

You can think of the visibility relation as capturing which write operations were visible to a read, and the arbitration relation as capturing how the data structure should reconcile conflicting writes.

Specifying behavior based on visibility and arbitration

Unfortunately, in a distributed world, we can no longer use the sequential specification for determining correct behavior. In the sequential world, writes are always totally ordered, but in the distributed world, we might have to deal with two different writes that aren’t ordered in a meaningful way.

For example, consider the following behavior:

    [id0]              [id1]                  [id2]
e1. add("A")
e2.                   add("B")
e3.                                          get()->???

What’s a valid value for ???. Let’s assume we’ve been told that: vis={(e1,e3),(e2,e3)}. This means that both writes are visible to process 3.

Based on our idea of how this data structure should work, e3 should either be: [“A”,”B”] or [“B”,”A”]. But the visibility relationship doesn’t provide enough information to tell us which one of these it was. We need some additional information to determine what the behavior should be.

This is where the arbitration relation comes in. This relation is always a total ordering. (For example, if ar specifies an ordering of e1->e2->e3, then the relation would be {(e1,e2),(e1,e3),(e2,e3)}. ).

If we define the behavior of our distributed queue such that the writes should happen in arbitration order, and we set ar=e1->e2->e3, then e3 would have to be get()->[“A”,”B”].

Let’s look at a few examples:

    [id0]              [id1]
e1. add("A")
e2.                    add("B")
e3. get()->["B","A"]
e4.                    get()->["B","A"]

The above history is valid, we can choose: vis={(e1,e3),(e2,e3),(e1,e4),(e2,e4)} and ar=e2->e1->e3->e4

    [id0]              [id1]
e1. add("A")
e2.                    add("B")
e3. get()->["A","B"]
e4.                    get()->["B","A"]

The above history is invalid, because there’s no arbitration and visibility relations we can come up with that can explain both e3 and e4.

    [id0]              [id1]
e1. add("A")
e2.                    add("B")
e3. get()->["A"]
e4.                    get()->["B","A"]

The above history is valid, because we can do: vis={(e1,e3),(e2,e4),(e3,e4))}, ar=e1->e2->e3->e4. Note that even though (e2,e3) is in ar, e2 is not visible to e3, and an operation only has to reflect the visible writes.

People don’t like it when you violate causality

Remember the example from “writes follow reads”?

[id0]
e1. q.get() -> []
e2. q.get() -> ["A: Hello"]
e3. q.add("A: Hello")

Note that we can come up with valid vis and ar relations for this history:

vis = {(e3,e2)}
ar = e1->e3->e2

But, despite the fact that we can come up with an explanation for this history, it doesn’t make sense to us, because e3 happened after e2. You can see why this is also referred to as session causality, because it violates our sense of causality: we read a write that happened in the future!

This is a great example of one of the differences between programming and formal modeling. It’s impossible to write a non-causal program (i.e., a program whose current output depends on future inputs). On the other hand, in formal modeling, we have no such restrictions, so we can always propose “impossible to actually happen in practice” behaviors to look at. So we often have to place additional constraints on the behaviors we generate with formal models to ensure that they’re actually realizable.

Sometimes we do encounter systems that record history in the wrong order, which makes the history look non-causal.

History is sometimes re-ordered in such a way that it looks like causality has been violated

Consistency as constraints on relations

The elegant thing about this relation-based model of execution histories is that the consistency models can be expressed in terms of them. Burckhardt conveniently defines two more relationships.

Session-order (so) is the ordering of events within each session, expressed as: so = rb ∩ ss

Happens-before (hb) is a causal ordering, in the sense of Lamport’s Time, Clocks, and the Ordering of Events in a Distributed System paper. (e1,e2) is in hb if (e1,e2) is in so (i.e., e1 comes before e2 in the same session), or if (e1,e2) is in vis (i.e., e1 is visible to e2), or if there’s some transitive relationship (e.g., there’s some e3 such that (e1,e3) and (e3,e2) are in so or vis.

Therefore, happens-before is the transitive closure of so ∪ vis, which we write as: hb = (so ∪ vis)⁺ . We can define no circular causality as no cycles in the hb relation or, as Burckhardt writes it: NoCircularCausality = acyclic(hb)

If you made it all of the way here, I’d encourage you to check out Burckhardt’s Principles of Eventual Consistency book. You can get the PDF for free by clicking the “Publication” button the web page.

For want of a dollar

Back in August, The New York Times ran a profile of Morris Chang, the founder of TSMC.

It’s hard to overstate the role that this Taiwan-based semiconductor company plays in the industry. If you search for articles about it, you’ll see headlines like TSMC: The Most Important Tech Company You Never Heard Of and TSMC: how a Taiwanese chipmaker became a linchpin of the global economy.

What struck me in the NY Times article was this anecdote about Chang’s search for a job after he failed out of a Ph.D. program at MIT in 1955 (emphasis mine):

Two of the best offers arrived from Ford Motor Company and Sylvania, a lesser-known electronics firm. Ford offered Mr. Chang $479 a month for a job at its research and development center in Detroit. Though charmed by the company’s recruiters, Mr. Chang was surprised to find the offer was $1 less than the $480 a month that Sylvania offered.

When he called Ford to ask for a matching offer, the recruiter, who had previously been kind, turned hostile and told him he would not get a cent more. Mr. Chang took the engineering job with Sylvania. There, he learned about transistors, the microchip’s most basic component.

“That was the start of my semiconductor career,” he said. “In retrospect, it was a damn good thing.”

The course of history changed because an internal recruiter Ford refused to offer him an additional dollar a month ($11.46 in 2023 dollars) to match a competing offer!

This is the sort of thing that historians call contingency.

Oddly influenced podcast

Brian Marick recently interviewed me about resilience engineering on his Oddly Influenced podcast. I’m pretty happy with how it turned out.

My SREcon 23 talk is up

The talk I gave at SREcon 23 Americas is now available for your viewing pleasure:

See also: an earlier post about my presentation style, which used this talk as an example.

Missing the forest for the trees: the component substitution fallacy

Here’s a brief excerpt from a talk by David Woods on what he calls the component substitution fallacy (emphasis mine):

claim of root cause is ex. of component substitution fallacy. All incidents that threaten failure reveal component weaknesses due to finite
resources & tradeoffs -> easy to miss the critical systemic/emergent factors see min 25 https://t.co/OsYy2U8fsA
— David Woods (@ddwoods2) January 16, 2023

Everybody is continuing to commit the component substitution fallacy.

Now, remember, everything has finite resources, and you have to make trade-offs. You’re under resource pressure, you’re under profitability pressure, you’re under schedule pressure. Those are real, they never go to zero.

So, as you develop things, you make trade offs, you prioritize some things over other things. What that means is that when a problem happens, it will reveal component or subsystem weaknesses. The trade offs and assumptions and resource decisions you made guarantee there are component weaknesses. We can’t afford to perfect all components.

Yes, improving them is great and that can be a lesson afterwards, but if you substitute component weaknesses for the systems-level understanding of what was driving the event … at a more fundamental level of understanding, you’re missing the real lessons.

Seeing component weaknesses is a nice way to block seeing the system properties, especially because this justifies a minimal response and avoids any struggle that systemic changes require.
Woods on Shock and Resilience (25:04 mark)

Whenever an incident happens, we’re always able to point to different components in our system and say “there was the problem!” There was a microservice that didn’t handle a certain type of error gracefully, or there was bad data that had somehow gotten past our validation checks, or a particular cluster was under-resourced because it hadn’t been configured properly, and so on.

These are real issues that manifested as an outage, and they are worth spending the time to identify and follow up on. But these problems in isolation never tell the whole story of how the incident actually happened. As Woods explains in the excerpt of his talk above, because of the constraints we work under, we simply don’t have the time to harden the software we work on to the point where these problems don’t happen anymore. It’s just too expensive. And so, we make tradeoffs, we make judgments about where to best spend our time as we build, test, and roll out our stuff. The riskier we perceive a change, the more effort we’ll spend on validation and rollout of the change.

And so, if we focus only on issues with individual components, there’s so much we miss about the nature of failure in our systems. We miss looking at the unexpected interactions between the components that enabled the failure to happen. We miss how the organization’s prioritization decisions enabled the incident in the first place. We also don’t ask questions like “if we are going to do follow-up work to fix the component problems revealed by this incident, what are the things that we won’t be doing because we’re prioritizing this instead?” or “what new types of unexpected interactions might we be creating by making these changes?” Not to mention incident-handling questions like “how did we figure out something was wrong here?”

In the wake of an incident, if we focus only on the weaknesses of individual components then we won’t see the systemic issues. And it’s the systemic will continue to bite us long after we’ve implemented all of those follow-up action items. We’ll never see the forest for the trees.

My talk from the LFI conference

My talk from the recent Learning from Incidents in Software conference is now up.

Unfortunately, the first few minutes were lost due to technical issues. You’ll just have to take my word for it that the missing part of my talk was truly astounding, a veritable tour de force.

My slides are also available for download.