I’m sure you’ve heard the slogan “safety first”. It is a statement of values for an organization, but let’s think about how to define what it should mean explicitly. Here’s how I propose to define safety first, in the context of a company. I’ll assume the company is in the tech (software) industry, since that’s the one I know best. So, in this context, you can think of “safety” as being about avoiding system outages, rather than about, say, avoiding injuries on a work site.
Here we go:
A tech company is a safety first company if any engineer has the ability to extend a project deadline, provided that the engineer judges in the moment that they need additional time in order to accomplish the work more safely (e.g., by following an onerous procedure for making a change, or doing additional validation work that is particular time-intensive).
This ability to extend the deadline must be:
automatic
unquestioned
consequence-free
Automatic. The engineer does not to explicitly ask someone else for permission before extending the deadline.
Unquestioned. Nobody is permitted to ask the engineer “why did you extend the deadline?” after-the-fact.
Consequence-free. This action cannot be held against the engineer. For example, it cannot be a factor in a performance review.
Now, anyone who has worked in management would say to me, “Lorin, this is ridiculous. If you give people the ability to extend deadlines without consequence, then they’re just going to use this constantly, even if there isn’t any benefit to safety. It’s going to drastically harm the organization’s ability to actually get anything done”.
And, the truth is, they’re absolutely right. We all work under deadlines, and we all know that if there was a magical “extend deadline” button that anyone could press, that button would be pressed a lot, and not always for the purpose of improving safety. Organizations need to execute, and if anybody could introduce delays, this would cripple execution.
But this response is exactly the reason why safety first will always be a lie. Production pressure is an unavoidable reality for all organizations. Because of this, the system will always push back against delays, and that includes delays for the benefit of safety. This means engineers will always face double binds, where they will feel pressure to execute on schedule, but will be punished if they make decisions that facilitate execution but reduce safety.
Safety is never first in organization: it’s always one of a number of factors that trade off against each other. And those sorts of tradeoff decisions happen day-to-day and moment-to-moment.
Remember that the next time someone is criticized for “not being careful enough” after a change brings down production.
The other day, The Future of TLA+ (pdf) hit Hacker News. TLA+ is a specification language: it is intended for describing the desired behavior of a system. Because it’s a specification language, you don’t need to specify implementation details to describe desired behavior. This can be confusing to experienced programmers who are newcomers to TLA+, because when you implement something in a programming language, you can always use either a compiler or an interpreter to turn it into an executable. But specs written in a specification language aren’t executable, they’re descriptions of permitted behaviors.
Just for fun, here’s an example: a TLA+ specification for a program that solves the twin prime conjecture. According to this spec, when the program runs, it will set the answer variable to either “yes” if the twin prime conjecture is true, or “no” if the conjecture is false.
Unfortunately, if I was to try to write a program that implements this spec, I don’t know whether answer should be set to “yes” or “no”: the conjecture has yet to be proven or disproven. Despite this, have still formally specified its behavior!
There’s a lot we simply don’t know about how reliability work was prioritized inside of CrowdStrike, but I’m going to propose a little thought experiment about the incident where I make some assumptions.
First, let’s assume that the CrowdStrike incident was the first time they had an incident that was triggered by a Rapid Response Content update, which is a config update. We’ll assume that previous sensor issues that led to Linux crashes were related to a sensor release, which is a code update.
Next, let’s assume that CrowdStrike focuses their reliability work on addressing the identified root cause of previous incidents.
Finally, let’s assume that none of the mitigations documented in their RCA were identified as action items that addressed the root cause of any incidents they experienced before this big one.
If these three assumptions are true, then it explains why these mitigations weren’t done previously. Because they didn’t address the root causes of previous incidents, and they focus their post-incident work on identifying the root cause of previous incidents. Now, I have no idea if any of these assumptions are actually true, but they sound plausible enough for this thought experiment to hold.
This thought experiment demonstrates the danger of focusing post-incident work on addressing the root causes of previous incidents: it acts to obscure other risks in the system that don’t happen to fit into the root cause analysis. After all, issues around validation of the channel files or staging of deploys were not really the root cause of any of the incidents before this one. The risks that are still in your system don’t care about what you have labeled “the real root cause” of the previous incident, and there’s no reason to believe that whatever gets this label is the thing that is most likely to bite you in the future.
I propose (cheekily) to refer to the prioritize-identifying-and-addressing-the-root-cause–of-previous-incidents thinking as the “CrowdStrike” approach to reliability work.
I put “CrowdStrike” in quotes because, in a sense, this really isn’t about them at all: I have no idea if the assumptions in this thought experiment are true. But my motivation for using this phrase is more about using CrowdStrike as a symbol that’s become salient to our industry then about the particular details of that company.
Are you on the lookout for the many different signals of risk in your system, or are you taking the “CrowdStrike” approach to reliability work?
CrowdStrike has released their final (sigh) External Root Cause Analysis doc. The writeup contains some more data on the specific failure mode. I’m not going to summarize it here, mostly because I don’t think I’d add any value in doing so: my knowledge of this system is no better than anyone else reading the report. I must admit, though, that I couldn’t helping thinking of number eleven in Alan Perlis’s epigrams in programming.
If you have a procedure with ten parameters, you probably missed some.
What I wanted to do instead with this blog is call out the last two of the “findings and mitigations” in the doc:
Template Instance validation should expand to include testing within the Content Interpreter
Template Instances should have staged deployment
This echos the chorus of responses I heard online in the aftermath of the outage. “Why didn’t they test these configs before deployment? How could they not stage their deploys?”
And this is my biggest disappointment with this writeup: it doesn’t provide us with insight into how the system got to this point.
Here are the types of questions I like to ask to try to get at this.
Had a rapid response content update ever triggered a crash before in the history of the company? If not, why do you think this type of failure (crash related to rapid response content) has never bitten the company before?If so, what happened last time?
Was there something novel about the IPC template type? (e.g., was this the first time the reading of one field was controlled by the value of another?)
How is generation of the test template instances typically done? Was the test template instance generation here a typical case or an exceptional one? If exceptional, what was different? If typical, how come it has never led to problems before?
Before the incident, had customers ever asked for the ability to do staged rollouts? If so, how was that ask prioritized relative to other work?
Was there any planned work to improve reliabilitybeforethe incident happened?What type of work was planned? How far along was it? How did you prioritize this work?
I know I’m a broken record here, but I’ll say it again. Systems reach the current state that they’re in because, in the past, people within the system made rational decisions based on the information they had at the time, and the constraints that they were operating under. The only way to understand how incidents happen is to try and reconstruct the path that the system took to get here, and that means trying to as best as you can to recreate the context that people were operating under when they made those decisions.
In particular, availability work tends to go to the areas where there was previously evidence of problems. That tends to be where I try to pick at things. Did we see problems in this area before? If we never had problems in this area before, what was different this time?
If we did see problems in the past, and those problems weren’t addressed, then that leads to a different set of questions. There are always more problems than resources, which means that orgs have to figure out what they’re going to prioritize (say “quarterly planning” to any software engineer and watch the light fade from their eyes). How does prioritization happen at the org?
It’s too much to hope for a public writeup to ever give that sort of insight, but I was hoping for something more about the story of “How we got here” in their final writeup. Unfortunately, it looks like this is all we get.
The last post I wrote mentioned in passing how Java’s ReentrantLock class is implemented as a modified version of something called a CLH lock. I’d never heard of a CLH lock before, and so I thought it would be fun to learn more about it by trying to model it. My model here is based on the implementation described in section 3.1 of the paper Building FIFO and Priority-Queueing Spin Locks from Atomic Swap by Travis Craig.
This type of lock is a first-in-first-out (FIFO) spin lock. FIFO means that processes acquire the lock in order in which they requested it, also known as first-come-first-served. A spin lock is a type of lock where the thread runs in a loop continually checking the state of a variable to determine whether it can take the lock. (Note that Java’s ReentrantLock is not a spin lock, despite being based on CLH).
Also note that I’m going to use the terms process and thread interchangeably in this post. The literature generally uses the term process, but I prefer thread because of the shared-memory nature of the lock.
Visualizing the state of a CLH lock
I thought I’d start off by showing you what a CLH data structure looks like under lock contention. Below is a visualization of the state of a CLH lock when:
There are three processes (P1, P2, P3)
P1 has the lock
P2 and P3 are waiting on the lock
The order that the lock was requested in is: P1, P2, P3
The paper uses the term requests to refer to the shared variables that hold the granted/pending state of the lock, those are the boxes with the sharp corners. The head of the line is at the left, and the end of the line is at the right, with a tail pointer pointing to the final request in line.
Each process owns two pointers: watch and myreq. These pointers each point to a request variable.
If I am a process, then my watch pointer points to a request that’s owned by the process that’s ahead of me in line. The myreq pointer points to the request that I own, which is the request that the process behind in me line will wait on. Over time, you can imagine the granted value propagating to the right as the processes release the lock.
This structure behaves like a queue, where P1 is at the head and P3 is at the tail. But it isn’t a conventional linked list: the pointers don’t form a chain.
Requesting the lock
When a new process requests the lock, it:
points its watch pointer to the current tail
creates a new pending request object and points to it with its myreq pointer
updates the tail to point to this new object
Releasing the lock
To release the lock, a process sets the value of its myreq request to granted. (It also takes ownership of the watch request, setting myreq to that request, for future use).
Modeling desired high-level behavior
Before we build our CLH model, let’s think about how we’ll verify if our model is correct. One way to do this is to start off with a description of the desired behavior of our lock without regard to the particular implementation details of CLH.
What we want out of a CLH lock is a lock that preserves the FIFO ordering of the requests. I defined a model I called FifoLock that has just the high-level behavior, namely:
it implements mutual exclusion
threads are granted the lock in first-come-first-served order
Execution path
Here’s the state transition diagram for each thread in my model. The arrows are annotated with the name of the TLA+ sub-actions in the model.
FIFO behavior
To model the FIFO behavior, I used a TLA+ sequence to implement a queue. The Request action appends the thread id to a sequence, and the Acquire action only grants the lock to the thread at the head of the sequence:
Now let’s build an actual model of the CLH algorithm. Note that I called these processes instead of threads because that’s what the original paper does, and I wanted to stay close to the paper in terminology to help readers.
The CLH model relies on pointers. I used a TLA+ function to model pointers. I defined a variable named requests that maps a request id to the value of the request variable. This way, I can model pointers by having processes interact with request values using request ids as a layer of indirection.
---- MODULE CLH ----
...
CONSTANTS NProc, GRANTED, PENDING, X
first == 0 \* Initial request on the queue, not owned by any process
Processes == 1..NProc
RequestIDs == Processes \union {first}
VARIABLES
requests, \* map request ids to request state
...
TypeOk ==
/\ requests \in [RequestIDs -> {PENDING, GRANTED, X}]
...
(I used “X” to mean “don’t care”, which is what Craig uses in his paper).
The way the algorithm works, there needs to be an initial request which is not owned by any of the process. I used 0 as the Request id for this initial request and called it first.
The myreq and watch pointers are modeled as functions that map from a process id to a request id. I used -1 to model a null pointer.
The state-transition diagram for each process in our CLH model is similar to the one in our abstract (FifoLock) model. The only real difference is that requesting the lock is done in two steps:
Update the request variable to pending (MarkPending)
Append the request to the tail of the queue (EnqueueRequest)
We want to verify that our CLH model always behaves in a way consistent with our FifoLock model. We can check this in using the model checker by specifying what’s called a refinement mapping. We need to show that:
We can create a mapping from the CLH model’s variables to the FifoLock model’s variables.
Every valid behavior of the CLH model satisfies the FifoLock specification under this mapping.
The mapping
The FifoLock model has two variables, state and queue.
The state variable is a function that maps each thread id to the state of that thread in its state machine. The dashed arrows show how I defined the state refinement mapping:
Note that two states in the CLH model (ready, to-enqueue) map to one state in FifoLock (ready).
For the queue mapping, I need to take the CLH representation (recall it looks like this)…
…and define a transformation so that I end up with a sequence that looks like this, which represents the queue in the FifoLock model:
<<P1, P2, P3>>
To do that, I defined a helper operator called QueueFromTail that, given a tail pointer in CLH, constructs a sequence of ids in the right order.
Unowned(request) ==
~ \E p \in Processes : myreq[p] = request
\* Need to reconstruct the queue to do our mapping
RECURSIVE QueueFromTail(_)
QueueFromTail(rid) ==
IF Unowned(rid) THEN <<>> ELSE
LET tl == CHOOSE p \in Processes : myreq[p] = rid
r == watch[tl]
IN Append(QueueFromTail(r), tl)
In my CLH.cfg file, I added this line to check the refinement mapping:
PROPERTY
Refinement
Note again what we’re doing here for validation: we’re checking if our more detailed model faithfully implements the behavior of a simpler model (under a given mapping). Note that we’re using TLA+ to serve two different purposes.
To specify the high-level behavior that we consider correct (our behavioral contract)
To model a specific algorithm (our implementation)
Because TLA+ allows us to choose the granularity of our model, we can use it for both high-level specs of desired behavior and low-level specifications of an algorithm.
Recently, some of my former colleagues wrote a blog post on the Netflix Tech Blog about a particularly challenging performance issue they ran into in production when using the new virtual threads feature of Java 21. Their post goes into a lot of detail on how they conducted their investigation and finally figured out what the issue was, and I highly recommend it. They also include a link to a simple Java program that reproduces the issue.
I thought it would be fun to see if I could model the system in enough detail in TLA+ to be able to reproduce the problem. Ultimately, this was a deadlock issue, and one of the things that TLA+ is good at is detecting deadlocks. While reproducing a known problem won’t find any new bugs, I still found the exercise useful because it helped me get a better understanding of the behavior of the technologies that are involved. There’s nothing like trying to model an existing system to teach you that you don’t actually understand the implementation details as well as you thought you did.
This problem only manifested when the following conditions were all present in the execution:
virtual threads
platform threads
All threads contending on a lock
some virtual threads trying to acquire the lock when inside a synchronized block
some virtual threads trying to acquire the lock when not inside a synchronized block
To be able to model this, we need to understand some details about:
virtual threads
synchronized blocks and how they interact with virtual threads
Java lock behavior when there are multiple threads waiting for the lock
Virtual threads
Virtual threads are very well-explained in JEP 444, and I don’t think I can really improve upon that description. I’ll provide just enough detail here so that my TLA+ model makes sense, but I recommend that you read the original source. (Side note: Ron Pressler, one of the authors of JEP 444, just so happens to be a TLA+ expert).
If you want to write a request-response style application that can handle multiple concurrent requests, then programming in a thread-per-request style is very appealing, because the style of programming is easy to reason about.
The problem with thread-per-request is that each individual thread consumes a non-trivial amount of system resources (most notably, dedicated memory for the call stack of each thread). If you have too many threads, then you can start to see performance issues. But in thread-per-request model, the number of concurrent requests your application service is capped by the number of threads. The number of threads your system can run can become the bottleneck.
You can avoid the thread bottleneck by writing in an asynchronous style instead of thread-per-request. However, async code is generally regarded as more difficult for programmers to write than thread-per-request.
Virtual threads are an attempt to keep the thread-per-request programming model while reducing the resource cost of individual threads, so that programs can run comfortably with many more threads.
The general approach of virtual threads is referred to as user-modethreads. Traditionally, threads in Java are wrappers around operating-system (OS) level threads, which means there is a 1:1 relationship between a thread from the Java programmer’s perspective and the operating system’s perspective. With user-mode threads, this is no longer true, which means that virtual threads are much “cheaper”: they consume fewer resources, so you can have many more of them. They still have the same API as OS level threads, which makes them easy to use by programmers who are familiar with threads.
However, you still need the OS level threads (which JEP 444 refers to as platform threads), as we’ll discuss in the next session.
How virtual threads actually run: carriers
For a virtual thread to execute, the scheduler needs to assign a platform thread to it. This platform thread is called the carrier thread, and JEP 444 uses the term mounting to refer to assigning a virtual thread to a carrier thread. The way you set up your Java application to use virtual threads is that you configure a pool of platform threads that will back your virtual threads. The scheduler will assign virtual threads to carrier threads to the pool.
A virtual thread may be assigned to different platform threads throughout the virtual thread’s lifetime: the scheduler is free to unmount a virtual thread from one platform thread and then later mount the virtual thread onto a different platform thread.
Importantly, a virtual thread needs to be mounted to a platform thread in order to run. If it isn’t mounted to a carrier thread, the virtual thread can’t run.
Synchronized
While virtual threads are new to Java, the synchronized statement has been in the language for a long time. From the Java docs:
A synchronized statement acquires a mutual-exclusion lock (§17.1) on behalf of the executing thread, executes a block, then releases the lock. While the executing thread owns the lock, no other thread may acquire the lock.
final class RealSpan extends Span {
...
final PendingSpans pendingSpans;
final MutableSpan state;
@Override public void finish(long timestamp) {
synchronized (state) {
pendingSpans.finish(context, timestamp);
}
}
In the example method above, the effect of the synchronized statement is to:
Acquire the lock associated with the state object
Execute the code inside of the block
Release the lock
(Note that in Java, every object is associated with a lock). This type of synchronization API is sometimes referred to as a monitor.
Synchronized blocks are relevant here because they interact with virtual threads, as described below.
Virtual threads, synchronized blocks, and pinning
Synchronized blocks guarantee mutual exclusion: only one thread can execute inside of the synchronized block. In order to enforce this guarantee, the Java engineers made the decision that, when a virtual thread enters a synchronized block, it must be pinned to its carrier thread. This means that the scheduler is not permitted to unmount a virtual thread from a carrier until the virtual thread exits the synchronized block.
ReentrantLock as FIFO
The last piece of the puzzle is related to the implementation details of a particular Java synchronization primitive: the ReentrantLock class. What’s important here is how this class behaves when there are multiple threads contending for the lock, and the thread that currently holds the lock releases it.
The relevant implementation detail is that this lock is a FIFO: it maintains a queue with the identity of the threads that are waiting on the lock, ordered by arrival. As you can see in the code, after a thread releases a lock, the next thread in line gets notified that the lock is free.
VirtualThreads is the set of virtual threads that execute the application logic. CarrierThreads is the set of platform threads that will act as carriers for the virtual threads. To model the failure mode that was originally observed in the blog post, I also needed to have platform threads that will execute the application logic. I called these FreePlatformThreads, because these platform threads are never bound to virtual threads.
I called the threads that directly execute the application logic, whether virtual or (free) platform, “work threads”:
schedule – function that describe how work threads are mounted to platform threads
inSyncBlock – the set of virtual threads that are currently inside a synchronized block
pinned – the set of carrier threads that are currently pinned to a virtual thread
lockQueue – a queue of work threads that are currently waiting for the lock
My model only models a single lock, which I believe is this lock in the original scenario. Note that I don’t model the locks associated with the synchronization blocks, because there was no contention associated with those locks in the scenario I’m modeling: I’m only modeling synchronization for the pinning behavior.
Program counters
Each worker thread can take one of two execution paths through the code. Here’s a simplified view of the two different code possible paths: the left code path is when a thread enters a synchronized block before it tries to acquire the lock, and the right code path is when a thread does not enter a synchronized block before it tries to acquire the lock.
This is conceptually how my model works, but I modeled the left-hand path a little differently than this diagram. The program counters in my model actually transition back from “to-enter-sync” to “ready”, and they use a separate state variable (inSyncBlock) to determine whether to transition to the to-exit-sync state at the end. The other difference is that the program counters in my model also will jump directly from “ready” to “locked” if the lock is currently free. But showing those state transitions directly would make the above diagram harder to read.
The lock
The FIFO lock is modeled as a TLA+ sequence of work threads. Note that I don’t need to actually model whether the lock is held or not. My model just checks that the thread that wants to enter the critical section is at the head of the queue, and when it leaves the critical section, it gets removed from the head. The relevant actions look like this:
To check if a thread has the lock, I just check if it’s in a state that requires the lock:
\* A thread has the lock if it's in one of the pc that requires the lock
HasTheLock(thread) == pc[thread] \in {"locked", "in-critical-section"}
I also defined this invariant that I checked with TLC to ensure that only the thread at the head of the lock queue can be in any of these states.
HeadHasTheLock ==
\A t \in WorkThreads : HasTheLock(t) => t = Head(lockQueue)
Pinning in synchronized blocks
When a virtual thread enters a synchronized block, the corresponding carrier thread gets marked as pinned. When it leaves a synchronized block, it gets unmarked as pinned.
\* We only care about virtual threads entering synchronized blocks
EnterSynchronizedBlock(virtual) ==
/\ pc[virtual] = "to-enter-sync"
/\ pinned' = pinned \union {schedule[virtual]}
...
ExitSynchronizedBlock(virtual) ==
/\ pc[virtual] = "to-exit-sync"
/\ pinned' = pinned \ {schedule[virtual]}
The scheduler
My model has an action named Mount that mounts a virtual thread to a carrier thread. In my model, this action can fire at any time, whereas in the actual implementation of virtual threads, I believe that thread scheduling only happens at blocking calls. Here’s what it looks like:
\* Mount a virtual thread to a carrier thread, bumping the other thread
\* We can only do this when the carrier is not pinned, and when the virtual threads is not already pinned
\* We need to unschedule the previous thread
Mount(virtual, carrier) ==
LET carrierInUse == \E t \in VirtualThreads : schedule[t] = carrier
prev == CHOOSE t \in VirtualThreads : schedule[t] = carrier
IN /\ carrier \notin pinned
/\ \/ schedule[virtual] = NULL
\/ schedule[virtual] \notin pinned
\* If a virtual thread has the lock already, then don't pre-empt it.
/\ carrierInUse => ~HasTheLock(prev)
/\ schedule' = IF carrierInUse
THEN [schedule EXCEPT ![virtual]=carrier, ![prev]=NULL]
ELSE [schedule EXCEPT ![virtual]=carrier]
/\ UNCHANGED <<lockQueue, pc, pinned, inSyncBlock>>
My scheduler also won’t pre-empt a thread that is holding the lock. That’s because, in the original blog post, nobody was holding the lock during deadlock, and I wanted to reproduce that specific scenario. So, if a thread gets a lock, the scheduler won’t stop it from releasing the lock. And, yet, we can still get a deadlock!
Only executing while scheduled
Every action is only allowed to fire if the thread is scheduled. I implemented it like this:
IsScheduled(thread) == schedule[thread] # NULL
\* Every action has this pattern:
Action(thread) ==
/\ IsScheduled(thread)
....
Finding the deadlock
Here’s the config file I used to run a simulation. I had defined three virtual threads, two carrier threads, and one free platform threads.
And, indeed, the model checker finds a deadlock! Here’s what the trace looks like in the Visual Studio Code plugin.
Note how virtual thread V2 is in the “requested” state according to its program counter, and V2 is at the head of the lockQueue, which means that it is able to take the lock, but the schedule shows “V2 :> NULL”, which means that the thread is not scheduled to run on any carrier threads.
And we can see in the “pinned” set that all of the carrier threads {C1,C2} are pinned. We also see that the threads V1 and V3 are behind V2 in the lock queue. They’re blocked waiting for the lock, and they’re holding on to the carrier threads, so V2 will never get a chance to execute.