CrowdStrike has released their final (sigh) External Root Cause Analysis doc. The writeup contains some more data on the specific failure mode. I’m not going to summarize it here, mostly because I don’t think I’d add any value in doing so: my knowledge of this system is no better than anyone else reading the report. I must admit, though, that I couldn’t helping thinking of number eleven in Alan Perlis’s epigrams in programming.
If you have a procedure with ten parameters, you probably missed some.
What I wanted to do instead with this blog is call out the last two of the “findings and mitigations” in the doc:
Template Instance validation should expand to include testing within the Content Interpreter
Template Instances should have staged deployment
This echos the chorus of responses I heard online in the aftermath of the outage. “Why didn’t they test these configs before deployment? How could they not stage their deploys?”
And this is my biggest disappointment with this writeup: it doesn’t provide us with insight into how the system got to this point.
Here are the types of questions I like to ask to try to get at this.
Had a rapid response content update ever triggered a crash before in the history of the company? If not, why do you think this type of failure (crash related to rapid response content) has never bitten the company before?If so, what happened last time?
Was there something novel about the IPC template type? (e.g., was this the first time the reading of one field was controlled by the value of another?)
How is generation of the test template instances typically done? Was the test template instance generation here a typical case or an exceptional one? If exceptional, what was different? If typical, how come it has never led to problems before?
Before the incident, had customers ever asked for the ability to do staged rollouts? If so, how was that ask prioritized relative to other work?
Was there any planned work to improve reliabilitybeforethe incident happened?What type of work was planned? How far along was it? How did you prioritize this work?
I know I’m a broken record here, but I’ll say it again. Systems reach the current state that they’re in because, in the past, people within the system made rational decisions based on the information they had at the time, and the constraints that they were operating under. The only way to understand how incidents happen is to try and reconstruct the path that the system took to get here, and that means trying to as best as you can to recreate the context that people were operating under when they made those decisions.
In particular, availability work tends to go to the areas where there was previously evidence of problems. That tends to be where I try to pick at things. Did we see problems in this area before? If we never had problems in this area before, what was different this time?
If we did see problems in the past, and those problems weren’t addressed, then that leads to a different set of questions. There are always more problems than resources, which means that orgs have to figure out what they’re going to prioritize (say “quarterly planning” to any software engineer and watch the light fade from their eyes). How does prioritization happen at the org?
It’s too much to hope for a public writeup to ever give that sort of insight, but I was hoping for something more about the story of “How we got here” in their final writeup. Unfortunately, it looks like this is all we get.
The last post I wrote mentioned in passing how Java’s ReentrantLock class is implemented as a modified version of something called a CLH lock. I’d never heard of a CLH lock before, and so I thought it would be fun to learn more about it by trying to model it. My model here is based on the implementation described in section 3.1 of the paper Building FIFO and Priority-Queueing Spin Locks from Atomic Swap by Travis Craig.
This type of lock is a first-in-first-out (FIFO) spin lock. FIFO means that processes acquire the lock in order in which they requested it, also known as first-come-first-served. A spin lock is a type of lock where the thread runs in a loop continually checking the state of a variable to determine whether it can take the lock. (Note that Java’s ReentrantLock is not a spin lock, despite being based on CLH).
Also note that I’m going to use the terms process and thread interchangeably in this post. The literature generally uses the term process, but I prefer thread because of the shared-memory nature of the lock.
Visualizing the state of a CLH lock
I thought I’d start off by showing you what a CLH data structure looks like under lock contention. Below is a visualization of the state of a CLH lock when:
There are three processes (P1, P2, P3)
P1 has the lock
P2 and P3 are waiting on the lock
The order that the lock was requested in is: P1, P2, P3
The paper uses the term requests to refer to the shared variables that hold the granted/pending state of the lock, those are the boxes with the sharp corners. The head of the line is at the left, and the end of the line is at the right, with a tail pointer pointing to the final request in line.
Each process owns two pointers: watch and myreq. These pointers each point to a request variable.
If I am a process, then my watch pointer points to a request that’s owned by the process that’s ahead of me in line. The myreq pointer points to the request that I own, which is the request that the process behind in me line will wait on. Over time, you can imagine the granted value propagating to the right as the processes release the lock.
This structure behaves like a queue, where P1 is at the head and P3 is at the tail. But it isn’t a conventional linked list: the pointers don’t form a chain.
Requesting the lock
When a new process requests the lock, it:
points its watch pointer to the current tail
creates a new pending request object and points to it with its myreq pointer
updates the tail to point to this new object
Releasing the lock
To release the lock, a process sets the value of its myreq request to granted. (It also takes ownership of the watch request, setting myreq to that request, for future use).
Modeling desired high-level behavior
Before we build our CLH model, let’s think about how we’ll verify if our model is correct. One way to do this is to start off with a description of the desired behavior of our lock without regard to the particular implementation details of CLH.
What we want out of a CLH lock is a lock that preserves the FIFO ordering of the requests. I defined a model I called FifoLock that has just the high-level behavior, namely:
it implements mutual exclusion
threads are granted the lock in first-come-first-served order
Execution path
Here’s the state transition diagram for each thread in my model. The arrows are annotated with the name of the TLA+ sub-actions in the model.
FIFO behavior
To model the FIFO behavior, I used a TLA+ sequence to implement a queue. The Request action appends the thread id to a sequence, and the Acquire action only grants the lock to the thread at the head of the sequence:
Now let’s build an actual model of the CLH algorithm. Note that I called these processes instead of threads because that’s what the original paper does, and I wanted to stay close to the paper in terminology to help readers.
The CLH model relies on pointers. I used a TLA+ function to model pointers. I defined a variable named requests that maps a request id to the value of the request variable. This way, I can model pointers by having processes interact with request values using request ids as a layer of indirection.
---- MODULE CLH ----
...
CONSTANTS NProc, GRANTED, PENDING, X
first == 0 \* Initial request on the queue, not owned by any process
Processes == 1..NProc
RequestIDs == Processes \union {first}
VARIABLES
requests, \* map request ids to request state
...
TypeOk ==
/\ requests \in [RequestIDs -> {PENDING, GRANTED, X}]
...
(I used “X” to mean “don’t care”, which is what Craig uses in his paper).
The way the algorithm works, there needs to be an initial request which is not owned by any of the process. I used 0 as the Request id for this initial request and called it first.
The myreq and watch pointers are modeled as functions that map from a process id to a request id. I used -1 to model a null pointer.
The state-transition diagram for each process in our CLH model is similar to the one in our abstract (FifoLock) model. The only real difference is that requesting the lock is done in two steps:
Update the request variable to pending (MarkPending)
Append the request to the tail of the queue (EnqueueRequest)
We want to verify that our CLH model always behaves in a way consistent with our FifoLock model. We can check this in using the model checker by specifying what’s called a refinement mapping. We need to show that:
We can create a mapping from the CLH model’s variables to the FifoLock model’s variables.
Every valid behavior of the CLH model satisfies the FifoLock specification under this mapping.
The mapping
The FifoLock model has two variables, state and queue.
The state variable is a function that maps each thread id to the state of that thread in its state machine. The dashed arrows show how I defined the state refinement mapping:
Note that two states in the CLH model (ready, to-enqueue) map to one state in FifoLock (ready).
For the queue mapping, I need to take the CLH representation (recall it looks like this)…
…and define a transformation so that I end up with a sequence that looks like this, which represents the queue in the FifoLock model:
<<P1, P2, P3>>
To do that, I defined a helper operator called QueueFromTail that, given a tail pointer in CLH, constructs a sequence of ids in the right order.
Unowned(request) ==
~ \E p \in Processes : myreq[p] = request
\* Need to reconstruct the queue to do our mapping
RECURSIVE QueueFromTail(_)
QueueFromTail(rid) ==
IF Unowned(rid) THEN <<>> ELSE
LET tl == CHOOSE p \in Processes : myreq[p] = rid
r == watch[tl]
IN Append(QueueFromTail(r), tl)
In my CLH.cfg file, I added this line to check the refinement mapping:
PROPERTY
Refinement
Note again what we’re doing here for validation: we’re checking if our more detailed model faithfully implements the behavior of a simpler model (under a given mapping). Note that we’re using TLA+ to serve two different purposes.
To specify the high-level behavior that we consider correct (our behavioral contract)
To model a specific algorithm (our implementation)
Because TLA+ allows us to choose the granularity of our model, we can use it for both high-level specs of desired behavior and low-level specifications of an algorithm.
Recently, some of my former colleagues wrote a blog post on the Netflix Tech Blog about a particularly challenging performance issue they ran into in production when using the new virtual threads feature of Java 21. Their post goes into a lot of detail on how they conducted their investigation and finally figured out what the issue was, and I highly recommend it. They also include a link to a simple Java program that reproduces the issue.
I thought it would be fun to see if I could model the system in enough detail in TLA+ to be able to reproduce the problem. Ultimately, this was a deadlock issue, and one of the things that TLA+ is good at is detecting deadlocks. While reproducing a known problem won’t find any new bugs, I still found the exercise useful because it helped me get a better understanding of the behavior of the technologies that are involved. There’s nothing like trying to model an existing system to teach you that you don’t actually understand the implementation details as well as you thought you did.
This problem only manifested when the following conditions were all present in the execution:
virtual threads
platform threads
All threads contending on a lock
some virtual threads trying to acquire the lock when inside a synchronized block
some virtual threads trying to acquire the lock when not inside a synchronized block
To be able to model this, we need to understand some details about:
virtual threads
synchronized blocks and how they interact with virtual threads
Java lock behavior when there are multiple threads waiting for the lock
Virtual threads
Virtual threads are very well-explained in JEP 444, and I don’t think I can really improve upon that description. I’ll provide just enough detail here so that my TLA+ model makes sense, but I recommend that you read the original source. (Side note: Ron Pressler, one of the authors of JEP 444, just so happens to be a TLA+ expert).
If you want to write a request-response style application that can handle multiple concurrent requests, then programming in a thread-per-request style is very appealing, because the style of programming is easy to reason about.
The problem with thread-per-request is that each individual thread consumes a non-trivial amount of system resources (most notably, dedicated memory for the call stack of each thread). If you have too many threads, then you can start to see performance issues. But in thread-per-request model, the number of concurrent requests your application service is capped by the number of threads. The number of threads your system can run can become the bottleneck.
You can avoid the thread bottleneck by writing in an asynchronous style instead of thread-per-request. However, async code is generally regarded as more difficult for programmers to write than thread-per-request.
Virtual threads are an attempt to keep the thread-per-request programming model while reducing the resource cost of individual threads, so that programs can run comfortably with many more threads.
The general approach of virtual threads is referred to as user-modethreads. Traditionally, threads in Java are wrappers around operating-system (OS) level threads, which means there is a 1:1 relationship between a thread from the Java programmer’s perspective and the operating system’s perspective. With user-mode threads, this is no longer true, which means that virtual threads are much “cheaper”: they consume fewer resources, so you can have many more of them. They still have the same API as OS level threads, which makes them easy to use by programmers who are familiar with threads.
However, you still need the OS level threads (which JEP 444 refers to as platform threads), as we’ll discuss in the next session.
How virtual threads actually run: carriers
For a virtual thread to execute, the scheduler needs to assign a platform thread to it. This platform thread is called the carrier thread, and JEP 444 uses the term mounting to refer to assigning a virtual thread to a carrier thread. The way you set up your Java application to use virtual threads is that you configure a pool of platform threads that will back your virtual threads. The scheduler will assign virtual threads to carrier threads to the pool.
A virtual thread may be assigned to different platform threads throughout the virtual thread’s lifetime: the scheduler is free to unmount a virtual thread from one platform thread and then later mount the virtual thread onto a different platform thread.
Importantly, a virtual thread needs to be mounted to a platform thread in order to run. If it isn’t mounted to a carrier thread, the virtual thread can’t run.
Synchronized
While virtual threads are new to Java, the synchronized statement has been in the language for a long time. From the Java docs:
A synchronized statement acquires a mutual-exclusion lock (§17.1) on behalf of the executing thread, executes a block, then releases the lock. While the executing thread owns the lock, no other thread may acquire the lock.
final class RealSpan extends Span {
...
final PendingSpans pendingSpans;
final MutableSpan state;
@Override public void finish(long timestamp) {
synchronized (state) {
pendingSpans.finish(context, timestamp);
}
}
In the example method above, the effect of the synchronized statement is to:
Acquire the lock associated with the state object
Execute the code inside of the block
Release the lock
(Note that in Java, every object is associated with a lock). This type of synchronization API is sometimes referred to as a monitor.
Synchronized blocks are relevant here because they interact with virtual threads, as described below.
Virtual threads, synchronized blocks, and pinning
Synchronized blocks guarantee mutual exclusion: only one thread can execute inside of the synchronized block. In order to enforce this guarantee, the Java engineers made the decision that, when a virtual thread enters a synchronized block, it must be pinned to its carrier thread. This means that the scheduler is not permitted to unmount a virtual thread from a carrier until the virtual thread exits the synchronized block.
ReentrantLock as FIFO
The last piece of the puzzle is related to the implementation details of a particular Java synchronization primitive: the ReentrantLock class. What’s important here is how this class behaves when there are multiple threads contending for the lock, and the thread that currently holds the lock releases it.
The relevant implementation detail is that this lock is a FIFO: it maintains a queue with the identity of the threads that are waiting on the lock, ordered by arrival. As you can see in the code, after a thread releases a lock, the next thread in line gets notified that the lock is free.
VirtualThreads is the set of virtual threads that execute the application logic. CarrierThreads is the set of platform threads that will act as carriers for the virtual threads. To model the failure mode that was originally observed in the blog post, I also needed to have platform threads that will execute the application logic. I called these FreePlatformThreads, because these platform threads are never bound to virtual threads.
I called the threads that directly execute the application logic, whether virtual or (free) platform, “work threads”:
schedule – function that describe how work threads are mounted to platform threads
inSyncBlock – the set of virtual threads that are currently inside a synchronized block
pinned – the set of carrier threads that are currently pinned to a virtual thread
lockQueue – a queue of work threads that are currently waiting for the lock
My model only models a single lock, which I believe is this lock in the original scenario. Note that I don’t model the locks associated with the synchronization blocks, because there was no contention associated with those locks in the scenario I’m modeling: I’m only modeling synchronization for the pinning behavior.
Program counters
Each worker thread can take one of two execution paths through the code. Here’s a simplified view of the two different code possible paths: the left code path is when a thread enters a synchronized block before it tries to acquire the lock, and the right code path is when a thread does not enter a synchronized block before it tries to acquire the lock.
This is conceptually how my model works, but I modeled the left-hand path a little differently than this diagram. The program counters in my model actually transition back from “to-enter-sync” to “ready”, and they use a separate state variable (inSyncBlock) to determine whether to transition to the to-exit-sync state at the end. The other difference is that the program counters in my model also will jump directly from “ready” to “locked” if the lock is currently free. But showing those state transitions directly would make the above diagram harder to read.
The lock
The FIFO lock is modeled as a TLA+ sequence of work threads. Note that I don’t need to actually model whether the lock is held or not. My model just checks that the thread that wants to enter the critical section is at the head of the queue, and when it leaves the critical section, it gets removed from the head. The relevant actions look like this:
To check if a thread has the lock, I just check if it’s in a state that requires the lock:
\* A thread has the lock if it's in one of the pc that requires the lock
HasTheLock(thread) == pc[thread] \in {"locked", "in-critical-section"}
I also defined this invariant that I checked with TLC to ensure that only the thread at the head of the lock queue can be in any of these states.
HeadHasTheLock ==
\A t \in WorkThreads : HasTheLock(t) => t = Head(lockQueue)
Pinning in synchronized blocks
When a virtual thread enters a synchronized block, the corresponding carrier thread gets marked as pinned. When it leaves a synchronized block, it gets unmarked as pinned.
\* We only care about virtual threads entering synchronized blocks
EnterSynchronizedBlock(virtual) ==
/\ pc[virtual] = "to-enter-sync"
/\ pinned' = pinned \union {schedule[virtual]}
...
ExitSynchronizedBlock(virtual) ==
/\ pc[virtual] = "to-exit-sync"
/\ pinned' = pinned \ {schedule[virtual]}
The scheduler
My model has an action named Mount that mounts a virtual thread to a carrier thread. In my model, this action can fire at any time, whereas in the actual implementation of virtual threads, I believe that thread scheduling only happens at blocking calls. Here’s what it looks like:
\* Mount a virtual thread to a carrier thread, bumping the other thread
\* We can only do this when the carrier is not pinned, and when the virtual threads is not already pinned
\* We need to unschedule the previous thread
Mount(virtual, carrier) ==
LET carrierInUse == \E t \in VirtualThreads : schedule[t] = carrier
prev == CHOOSE t \in VirtualThreads : schedule[t] = carrier
IN /\ carrier \notin pinned
/\ \/ schedule[virtual] = NULL
\/ schedule[virtual] \notin pinned
\* If a virtual thread has the lock already, then don't pre-empt it.
/\ carrierInUse => ~HasTheLock(prev)
/\ schedule' = IF carrierInUse
THEN [schedule EXCEPT ![virtual]=carrier, ![prev]=NULL]
ELSE [schedule EXCEPT ![virtual]=carrier]
/\ UNCHANGED <<lockQueue, pc, pinned, inSyncBlock>>
My scheduler also won’t pre-empt a thread that is holding the lock. That’s because, in the original blog post, nobody was holding the lock during deadlock, and I wanted to reproduce that specific scenario. So, if a thread gets a lock, the scheduler won’t stop it from releasing the lock. And, yet, we can still get a deadlock!
Only executing while scheduled
Every action is only allowed to fire if the thread is scheduled. I implemented it like this:
IsScheduled(thread) == schedule[thread] # NULL
\* Every action has this pattern:
Action(thread) ==
/\ IsScheduled(thread)
....
Finding the deadlock
Here’s the config file I used to run a simulation. I had defined three virtual threads, two carrier threads, and one free platform threads.
And, indeed, the model checker finds a deadlock! Here’s what the trace looks like in the Visual Studio Code plugin.
Note how virtual thread V2 is in the “requested” state according to its program counter, and V2 is at the head of the lockQueue, which means that it is able to take the lock, but the schedule shows “V2 :> NULL”, which means that the thread is not scheduled to run on any carrier threads.
And we can see in the “pinned” set that all of the carrier threads {C1,C2} are pinned. We also see that the threads V1 and V3 are behind V2 in the lock queue. They’re blocked waiting for the lock, and they’re holding on to the carrier threads, so V2 will never get a chance to execute.
One of the workhorses of the modern software world is the key-value store. there are key-value services such as Redis or Dynamo, and some languages build key-value data structures right in to the language (examples include Go, Python and Clojure). Even relational databases, which are not themselves key-value stores, are frequently built on top of a data structure that exposes a key-value interface.
Key-value data structures are often referred to as maps or dictionaries, but here I want to call attention to the less-frequently-used term associative array. This term evokes the associative nature of the store: the value that we store is associated with the key.
When I work on an incident writeup, I try to capture details and links to the various artifacts that the incident responders used in order to diagnose and mitigate the incident. Examples of such details include:
dashboards (with screenshots that illustrate the problem and links to the dashboard)
names of feature flags that were used or considered for remediation
relevant Slack channels where there was coordination (other than the incident channel)
Why bother with these details? I see it as a strategy to tackle the problem that Woods and Cook refer to in Perspectives on Human Error: Hindsight Biases and Local Rationality as the problem of inert knowledge. There may be information about that we learned at some point, but we can’t bring it to bear when an incident is happening. However, we humans are good at remembering other incidents! And so, my hope is that when an operational surprise happens, someone will remember “Oh yeah, I remember reading about something like this when incident XYZ happened”, and then they can go look up the incident writeup to incident XYZ and see the details that they need to help them respond.
In other words, the previous incidents act as keys, and the content of the incident write-ups act as the value. If you make the incident write-ups memorable, then people may just remember enough about them to look up the write-ups and page in details about the relevant tools right when they need them.
Below is a screenshot of Vizceral, a tool that was built by a former teammate of mine at Netflix. It provides a visualization of the interactions between the various microservices.
Vizceral uses moving dots to depict how requests are currently flowing through the Netflix microservice architecture. Vizceral is able to do its thing because of the platform tooling, which provides support for generating a visualization like this by exporting a standard set of inter-process communication (IPC) metrics.
What you don’t see depicted here are the interactions between those microservices and the telemetry platform that ingest these metrics. There’s also logging and tracing data, and those get shipped off-box via different channels, but none of those channels show up in this diagram.
In fact, this visualization doesn’t represent interactions with any of the platform services. You won’t see bubbles that represent the compute platform or the CI/CD platform represented in a diagram like this, even though those platform services all interact with these application services in important ways.
I call the first category of interactions, the ones between the application services, as first-class, and the second category, the ones where the interactions involve platform services, as second-class. It’s those second-class interactions that I want to say more about.
These second-class interactions tend to have a large blast radius, because successful platforms by their nature have a large blast radius. There’s a reason why there’s so much havoc out in the world when AWS’s us-east-1 region has a problem: because so many services out there are using us-east-1 as a platform. Similarly, if you have a successful platform within your organization, then by definition it’s going to see a lot of use, which means that if it experiences a problem, it can do a lot of damage.
These platforms are generally more reliable than the applications that run atop them, because they have to be: platforms naturally have higher reliability requirements than the applications that run atop them. They have these requirements because they have a large blast radius. A flaky platform is a platform that contributes to multiple high-severity outages, and systems that contribute to multiple high-severity outages are the systems were reliability work gets prioritized.
And a reliable system is a system whose details you aren’t aware of, because you don’t need to be. If my car is very reliable, then I’m not going to build an accurate mental model of how my car works, because I don’t need to: it just works. In her book Human-Machine Reconfigurations: Plans and Situated Actions, the anthropologist Lucy Suchman used the term representation to describe the activity of explicitly constructing a mental model of how a piece of technology works, and she noted that this type of cognitive work only happens when we run into trouble. As Suchman puts it:
[R]epresentation occurs when otherwise transparent activity becomes in some way problematic
Hence the irony: these second-class interactions tend not to be represented in our system models when we talk about reliability, because they are generally not problematic.
And so we are lulled into a false sense of security. We don’t think about how the plumbing works, because the plumbing just works. Until the plumbing breaks. And then we’re in big trouble.
Yesterday, CrowdStrike released a Preliminary Post-Incident Review of the major outage that happened last week. I’m going to wait until the full post-incident review is released before I do any significant commentary, and I expect we’ll have to wait at least a month for that. But I did want to highlight one section of the doc from the section titled What Happened on July 19, 2024, emphasis mine
On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.
Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.
And now, let’s reach way back into the distant past of three weeks ago, when the The Canadian Radio-television and Telecommunications Commission (CRTC) posted an executive summary of a major outage, which I blogged about at the time. Here’s the part I want to call attention to, once again, emphasis mine.
Rogers had initially assessed the risk of this seven-phased process as “High.” However, as changes in prior phases were completed successfully, the risk assessment algorithm downgraded the risk level for the sixth phase of the configuration change to “Low” risk, including the change that caused the July 2022 outage.
In both cases, the engineers had built up confidence over time that the types of production changes they were making were low risk.
When we’re doing something new with a technology, we tend to be much more careful with it, it’s riskier, we’re shaking things out. But, over time, after there haven’t been any issues, we start to gain more trust in the tech, confident that it’s a reliable technology. Our internal perception of the risk adjusts based on our experience, and we come to believe that the risks of these sorts of changes are low. Any organization who concentrates their reliability efforts on action items in the wake of an incident, rather than focusing on the normal work that doesn’t result in incidents, is implicit making this type of claim. The squeaky incident gets the reliability grease. And, indeed, it’s rational to allocate your reliability effort based on your perception of risk. Any change can break us, but we can’t treat every change the same. How could it be otherwise?
The challenge for us is that large incidents are not always preceded by smaller ones, which means that there may be risks in your system that haven’t manifested as minor outages. I think these types of risks are the most dangerous ones of all, precisely because they’re much harder for the organization to see. How are you going to prioritize doing the availability work for a problem that hasn’t bitten you yet, when your smaller incidents demonstrate that you have been bitten by so many other problems?
This means that someone has to hunt for weak signals of risk and advocate for doing the kind of reliability work where there isn’t a pattern of incidents you can point to as justification. The big ones often don’t look like the small ones, and sometimes the only signal you get in advance is a still, small sound.
One of the benefits of having attended a religious elementary school (despite being a fairly secular one as religious schools go) is exposure to the text of the Bible. There are two verses from the first book of Kings that have always stuck with me. They are from I Kings 19:11–13, where God is speaking to the prophet Elijah. I’ve taken the text below from the English translation that was done by A.J. Rosenberg:
And He said: Go out and stand in the mountain before the Lord, behold!
The Lord passes, and a great and strong wind splitting mountains and shattering boulders before the Lord, but the Lord was not in the wind.
And after the wind an earthquake-not in the earthquake was the Lord.
After the earthquake fire, not in the fire was the Lord,
It’s in the aftermath of the big incidents where most of the attention is focused when it comes to investing in reliability: the great and strong winds, the earthquakes, the fires. But I think it’s the subtler signals, what Conklin refers to as the yellow (as opposed to the glaring signals of the red or the absence of signal of the green) that are the ones that communicate to us about the risk of the major outage that has not happened yet.
It’s these still small sounds we need to able to hear in order to see the risks that could eventually lead to catastrophe. But to hear these sounds, we need to listen for them.
In the 1980s, the anthropologist Lucy Suchman studied how office workers interacted with sophisticated photocopiers. What she found was that people’s actions were not determined by predefined plans. Instead, people decided what act to take based on the details of the particular situation they found themselves in. They used predefined plans as resources for helping them choose which action to take, rather than as a set of instructions to follow.
I couldn’t help thinking of Suchman when reading How Life Works. In it, the British science writer Philip Ball presents a new view of the role of DNA, genes, and the cell in the field of biology. Just as Suchman argued that people use plans as resources rather than explicit instructions, Ball discusses how the cell uses DNA as resources. Our genetic code is a toolbox, not a blueprint.
Imagine you’re on a software team that owns a service, and an academic researcher who is interested in software but doesn’t really know anything about it comes to and asks, “What function does redis play in your service? What would happen if it got knocked out?”. This is a reasonable question, and you explain the role that redis plays in improving performance through caching. And then he asks another question: “What function does your IDE’s debugger play in your service?” He notices the confused look on your face and tries to clarify the question by asking, “Imagine another team had to build the same service, but they didn’t have the IDE debugger? How would the behavior of the service be different? Which functions would be impaired” And you try to explain that you don’t actually know how it would be different. That the debugger, unlike redis, is a tool, which is sometimes used to help diagnose problems. But there are multiple ways to debug (for example, using log statements). It might not make any difference at all if that new team doesn’t happen to use the debugger. There’s no direct mapping of the debugger’s presence to the service’s functionality: the debugger doesn’t play a functional role in the service the way that redis does. In fact, the next team that builds a service might not end up needing to use the debugger at all, so removing it might have no observable effect on the next service.
The old view sees these DNA segments like redis, having an explicit functional role, and the new view sees them more like a debugger, as tools to support the cell in performing functions. As Ball puts it, “The old view of genes as distinct segments of DNA strung along the chromosomes like beads, interspersed with junk, and each controlling some aspect of phenotype, was basically a kind of genetic phrenology.” The research has shown that the story is more complex than that, and that there is no simple mapping between DNA segments in our chromosomes and observed traits, or phenotypes. Instead, these DNA segments are yet another input in a complex web of dynamic, interacting components. Instead of focusing on these DNA strands of our genome, Ball directs our attention on the cell as a more useful unit of analysis. A genome, he points out, is not capable of constructing a cell. Rather, a cell is always the context that must exist for the genome to be able to do anything.
The problem space that evolution works in is very different from the one that human engineers deal with, and, consequently, the solution space can appear quite alien to us. The watch has long been a metaphor for biological organisms (for example, Dawkins’s book “The Blind Watchmaker”), but biological systems are not like watches with their well-machined gears. The micro-world of the cell contains machinery at the scale of molecules, which is a very noisy place. Because biological systems must be energy efficient, they function close to the limits of thermal noise. That requires a very different types of machines than the ones we interact with at our human scales. Biology can’t use specialized parts with high tolerances, but must instead make do with more generic parts that can be used to solve many different kinds of problems. And because the cells can use the same parts of the genome to solve different problems, asking questions like “what does that protein do” becomes much harder to answer: the function of a protein depends on the context in which the cell uses it, and the cell can use it in multiple contexts. Proteins are not like keys designed to fit specifically into unique locks, but bind promiscuously to different sites.
This book takes a very systems-thinking approach, as opposed to a mechanistic one, and consequently I find it very appealing. This is a complex, messy world of signaling networks, where behavior emerges from the interaction of genome and environment. There are many connections here to the field of resilience engineering, which has long viewed biology as a model (for example, see Richard Cook’s talk on the resilience of bone). In this model, the genome acts as a set of resources which the cell can leverage to adapt to different challenges. The genome is an example, possible the paradigmaticexample, of adaptive capacity. Or, as the biologists Michael Levin and Rafael Yuste put it, whom Ball quotes: “Evolution, it seems, doesn’t come up with answers so much as generate flexible problem-solving agents that can rise to new challenges and figure things out on their own.”
The full report doesn’t seem to be available yet, and I’m not sure if it ever will be publicly released. I recommend you read the executive summary, but here are some quick impressions of mine.
Note that I’m not a network engineer (I’ve only managed a single rack of servers in my time), so I don’t have any domain expertise here.
Migration!
When you hear “large-scale outage”, a good bet is that involved a migration. The language of the report describes it as an upgrade, but I suspect this qualifies as a migration.
In the weeks leading to the day of the outage on 8 July 2022, Rogers was executing on a seven-phase process to upgrade its IP core network. The outage occurred during the sixth phase of this upgrade process.
I don’t know anything about what’s involved a telecom upgrading its IP core network, but I do have a lot of general opinions about migrations, and I’m willing to bet they apply here as well.
I think of migrations as high-impact, bespoke changes that the system was not originally designed to accommodate.
They’re high-impact because things can go quite badly if something goes wrong. If you’ve worked at a larger company, you’ve probably experienced migrations that seem to take forever, and this is one of the reasons why: there’s a lot of downside risk in doing migration work (and often not much immediate upside benefit for the people who have to do the work, but that’s a story for another day).
Migrations are bespoke in the sense that each migration is a one-off. This makes migrations even more dangerous because:
The organization doesn’t have any operational muscles around doing any particular migration, because each one is new.
Because each migration is unique, it’s not worth the effort to build tooling to support doing the migration. And even if you build tools, those tools will always be new, which means they haven’t been hardened through production use.
There’s a reason why you hear about continuous integration and continuous delivery but not continuous migration, even though every org past a certain age will have multiple migrations in flight.
Finally, migrations are changes that the system was not originally designed to accommodate. In my entire career, during the design of a new system, I have never heard anyone ask, “How are we going to migrate off of this new system at the end of its life?” We just don’t design for migrating off of things. I don’t even know if it’s possible to do so.
Saturation!
Rogers staff removed the Access Control List policy filter from the configuration of the distribution routers. This consequently resulted in a flood of IP routing information into the core network routers, which triggered the outage. The flood of IP routing data from the distribution routers into the core routers exceeded their capacity to process the information. The core routers crashed within minutes from the time the policy filter was removed from the distribution routers configuration. When the core network routers crashed, user traffic could no longer be routed to the appropriate destination. Consequently, services such as mobile, home phone, Internet, business wireline connectivity, and 9-1-1 calling ceased functioning.
Saturation is a term from resilience engineering which refers to a system receiving being pushed to the limit of the amount of load that it can handle. It’s remarkable how many outages in distributed systems are related to some part of the system being overloaded, or hitting a rate limit, or exceeding some other limit. (For example, see Slack’s Jan 2021 outage). This incident is another textbook example of a brittle system, which falls over when it becomes saturated.
Perception of risk
I mentioned earlier that migrations are risky, and everyone knows migrations are risky. Roger engineers knew that as well:
Rogers had initially assessed the risk of this seven-phased process as “High.”
Ironically, the fact that the migration had gone smoothly up until that point led them to revise their risk assessment downwards.
However, as changes in prior phases were completed successfully, the risk assessment algorithm downgraded the risk level for the sixth phase of the configuration change to “Low” risk, including the change that caused the July 2022 outage.
I wrote about this phenomenon in a previous post, Any change can break us, but we can’t treat every change the same. The engineers gained confidence as they progressed through the migration, and things went well. Which is perfectly natural. In fact, this is one of the strengths of the continuous delivery approach: you build enough confidence that you don’t have to babysit every single deploy anymore.
But the problem is that we can never perfectly assess the risk in the system. And no matter how much confidence we build up, that one change that we believe is safe can end up taking down the whole system.
I should note that the report is pretty blame-y when it comes to this part:
Downgrading the risk assessment to “Low” for changing the Access Control List filter in a routing policy contravenes industry norms, which require high scrutiny for such configuration changes, including laboratory testing before deploying in the production network.
I wish I had more context here. How did it make sense to them at the time? What sorts of constraints or pressures were they under? Hopefully the full report reveals more details.
Cleanup
Rogers staff deleted the policy filter that prevented IP route flooding in an effort to clean up the configuration files of the distribution routers.
Cleanup work has many of the same risks as migration work: it’s high-impact and bespoke. Say “cleanup script” to an SRE and watch the reaction on their face.
But not cleaning up is also a risk! The solution can’t be “never do cleanup” in the same way it can’t be “never do migrations”. Rather, we need to recognize that this work always involve risk trade-offs. There’s no safe path here.
Failure mode makes incident response harder
At the time of the July 2022 outage, Rogers had a management network that relied on the Rogers IP core network. When the IP core network failed during the outage, remote Rogers employees were unable to access the management network. …
Rogers staff relied on the company’s own mobile and Internet services for connectivity to communicate among themselves. When both the wireless and wireline networks failed, Rogers staff, especially critical incident management staff, were not able to communicate effectively during the early hours of the outage.
When an outage affects not just your customers but also your engineers doing incident response, life gets a whole lot harder.
[A]s our engineers worked to figure out what was happening and why, they faced two large obstacles: first, it was not possible to access our data centers through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.
Component substitution fallacy
The authors point out that the system was not designed to handle this sort of overload:
Absence of router overload protection. The July 2022 outage exposed the absence of overload protection on the core network routers. The network failure could have been prevented had the core network routers been configured with an overload limit that specifies the maximum acceptable number of IP routing data the router can support. However, the Rogers core network routers were not configured with such overload protection mechanisms. Hence, when the policy filter was removed from the distribution router, an excessive amount of routing data flooded the core routers, which led them to crash.
To the authors’ credit, they explicitly acknowledge the tradeoffs involved in the overall design of the system.
The Rogers network is a national Tier 1 network and is architecturally designed for reliability; it is typical of what would be expected of such a Tier 1 service provider network. The July 2022 outage was not the result of a design flaw in the Rogers core network architecture. However, with both the wireless and wireline networks sharing a common IP core network, the scope of the outage was extreme in that it resulted in a catastrophic loss of all services. Such a network architecture is common to many service providers and is an example of the trend of converged wireline and wireless telecom networks. It is a design choice by service providers, including Rogers, that seeks to balance cost with performance.
I really hope the CRTC eventually releases the full report, I’m looking forward to reading it.
For databases that support transactions, there are different types of anomalies that can potentially occur: the higher the isolation level, the more classes of anomalies are eliminated (at a cost of reduced performance).
The anomaly that I always had the hardest time wrapping my head around was the one called a dirty write. This blog post is just to provide a specific example of a dirty write scenario and why it can be problematic. (For another example, see Adrian Coyler’s post).
Transaction T1 modifies a data item. Another transaction T2 then further modifies that data item before T1 performs a COMMIT or ROLLBACK.
Here’s an example. Imagine a bowling alley has a database where they keep track of who has checked out a pair of bowling shoes. For historical reasons, they track the left shoe borrower and the right shoe borrower as separate columns in the database:
Once upon a time, they used to let different people check out the left and the right shoe for a given pair. However, they don’t do that anymore: now both shoes must always be checked out by the same person. This is the invariant that must always be preserved in the database.
Consider the following two transactions, where both Alice and Bob are trying to borrow the pair of shoes with id=7.
-- Alice takes the id=7 pair of shoes
BEGIN TRANSACTION;
UPDATE shoes
SET left = "alice"
WHERE id=7;
UPDATE shoes
SET right= "alice"
WHERE id=7;
COMMIT TRANSACTION;
-- Bob takes the id=7 pair of shoes
BEGIN TRANSACTION;
UPDATE shoes
SET left = "bob"
WHERE id=7;
UPDATE shoes
SET right= "bob"
WHERE id=7;
COMMIT TRANSACTION;
Let’s assume that these two transactions run concurrently. We don’t care about whether Alice or Bob is the one who gets the shoes in the end, as long as the invariant is preserved (both left and right shoes are associated with the same person once the transactions complete).
Now imagine that the operations in these transactions happen to be scheduled so that the individual updates and commits end up running in this order:
SET left = “alice”
SET left = “bob”
SET right = “bob”
Commit Bob’s transaction
SET right = “alice”
Commit Alice’s transactions
I’ve drawn that visually here, where the top part of the diagram shows the operations grouped by Alice/Bob, and the bottom part shows them grouped by left/right column writes.
If the writes actually occur in this order, then the resulting table column will be:
id
left
right
7
“bob”
“alice”
Row 7 after the two transactions complete
After these transactions complete, this row violates our invariant! The dirty write is indicated in red: Bob’s write has clobbered Alice’s write.
Note that dirty writes don’t require any reads to occur during the transactions, nor are they only problematic for rollbacks.