Futexes in TLA+

Justine Tunney recently wrote a blog post titled The Fastest Mutexes where she describes how she implemented mutexes in Cosmopolitan Libc. The post discusses how her implementation uses futexes by way of Mike Burrows’s nsync library. From her post

nsync enlists the help of the operating system by using futexes. This is a great abstraction invented by Linux some years ago, that quickly found its way into other OSes. On MacOS, futexes are called ulock. On Windows, futexes are called WaitOnAddress(). The only OS Cosmo supports that doesn’t have futexes is NetBSD, which implements POSIX semaphores in kernelspace, and each semaphore sadly requires creating a new file descriptor. But the important thing about futexes and semaphores is they allow the OS to put a thread to sleep. That’s what lets nsync avoid consuming CPU time when there’s nothing to do.

Before I read this post, I had no idea what futexes were or how they worked. I figured a good way to learn would be to model them in TLA+.

Note: I’m going to give a simplified account of how futexes work. In addition, I’m not an expert on this topic. In particular, I’m not a kernel programmer, so there may be important details I get wrong here.

Mutexes

Readers who have done multithreaded programming before are undoubtedly familiar with mutexes: they are a type of lock that allows the programmer to enforce mutual exclusion, so that we can guarantee that at most one thread accesses a resource, such as a shared variable.

The locking mechanism is implemented by the operating system: locking and unlock the mutex ultimately involves a system call. If you were programming in C on a Unix-y system like Linux, you’d use the pthreads API to access the mutex objects. Which pthreads implementation you use (e.g., glibc, musl) will make the relevant system calls for you.

#include <pthread.h>
...
// create a mutex
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

pthread_mutex_lock(&mutex);
// mutex is now locked

pthread_mutex_unlock(&mutex);
// mutex is now unlocked

Modeling a mutex in TLA+

Before we get into futexes, let’s start off by modeling desired behavior of a mutex in TLA+. I’ll use the PlusCal algorithm language for my model, which can be translated directly to a TLA+ model (see my mutex.tla file)

---- MODULE mutex ----
CONSTANT Processes

(*--algorithm MutualExclusion

variables lock = {};

process proc \in Processes 
begin

ncs:  skip;
acq:  await lock = {};
      lock := {self};
cs:   skip;
rel:  lock := lock \ {self};
      goto ncs;
end process;

end algorithm;
*)

Modeling threads (Processes)

This model defines a fixed set of Processes. You should really think of these as threads, but there’s a convention in academia to refer to them as processes, so I used that convention. You can think of Processes in this model as a set of threads ids.

Modeling the lock

This model has only explicit variable, named lock, which is my model for the mutex lock. I’ve modeled it as a set. When the lock is free, the set is empty, and when the lock is held by a process, the set contains a single element, the process id.

Process states

PlusCal models also have an implicit variable, pc for program counter. The program counter indicates which state each process is in.

Each process can be in one of four states. We need to give a label to each of these states in our model.

LabelDescription
ncsin the non-critical section (lock is not held)
acqwaiting to acquire the lock
csin the critical section (lock is held)
relreleasing the lock

We want to make sure that two processes are never in the critical section at the same time. We can express this desired property in TLA+ like this:

InCriticalSection(p) == pc[p] = "cs"
MutualExclusion == \A p1,p2 \in Processes : InCriticalSection(p1) /\ InCriticalSection(p2) => p1 = p2

We can specify MutualExclusion as an invariant and use the TLC model checker to verify that our specification satisfies this property. Check out mutex.cfg for how to configure TLC to check the invariant.

Overhead of traditional mutexes

With a traditional mutex, the application make a system call every time the mutex is locked or unlocked. If the common usage pattern for an application is that there’s only one thread that tries to take the mutex, then you’re paying a performance penalty for having to execute those system calls. If the mutex is being acquired and released in a tight loop, then the time that goes to servicing the system calls could presumably be a substantial fraction of the execution time of the loop.

for(...) {
  pthread_mutex_lock(&mutex);
  // do work here
  pthread_mutex_unlock(&mutex);
}

I don’t know how high these overheads are in practice, but Justine Tunney provides some numbers in her blog post as well as a link to Mark Waterman’s mutex shootout with more performance numbers.

Ideally, we would only pay the performance penalty for system calls when the mutex was under contention, when there were multiple threads that were trying to acquire the lock.

Futexes: avoid the syscalls where possible

Futexes provide a mechanism for avoiding system calls for taking a mutex lock in cases where the lock is not under contention.

More specifically, futexes provide a set of low-level synchronization primitives. These primitives allow library authors to implement mutexes in such a way that they avoid making system calls when possible. Application programmers don’t interact with futexes directly, they’re hidden behind APIs like pthread_mutex_lock.

Primitives: wait and wake

The primary two futex primitives are a wait system call and a wake system call. Each of them take an integer pointer as an argument, which I call futex. Here’s a simplified C function prototype for each of them.

void futex_wait(int *futex); 
void futex_wake(int *futex);

(Actual versions of these return status codes, but I won’t be using those return codes in this blog post).

Note that the futex_wait prototype shown above is incomplete: it needs to take an additional argument to guarantee correctness, but we’ll get to that later. I want to start off by providing some intuition on how to use futexes.

When a thread calls the futex_wait system call, the kernel puts the thread to sleep until another thread calls futex_wake with the same futex argument.

Using primitives to build mutexes that avoid system calls

OK, so how do we actually use these things to create mutexes?

Here’s a naive (i.e., incorrect) implementation of lock and unlock functions that implement mutual exclusion using the futex calls. The lock function checks if the lock is free. If it is, it takes the lock, otherwise it waits to be woken up and then tries again.

#define FREE 0
#define ACQUIRED 1
#define CONTENDED 2

void futex_wait(int *futex);
void futex_wake(int *futex);

/**
 * Try to acquire the lock. On failure, wait and then try again.
 */
void lock(int *futex) {
    bool acquired = false;
    while (!acquired) {
        if (*futex == FREE) {
            *futex = ACQUIRED;
            acquired = true;
        }
        else {
            *futex = CONTENDED;
            futex_wait(futex, CONTENDED);
        }
    }
}

/**
 * Release lock, wake threads that are waiting on lock
 */
void unlock(int *futex) {
    int prev = *futex;
    *futex = FREE;

    if(prev == CONTENDED) {
        futex_wake(futex);
    }
}

Note that futex is just an ordinary pointer. In the fast path, the lock function just sets the futex to ACQUIRED, no system call is necessary. It’s only when the futex is not free that it has to make the futex_wait system call.

Similarly, on the unlock side, it’s only when the futex is in the CONTENDED state that the futex_wake system call happens.

Now’s a good time to point out that futex is short for fast userspace mutex. The reason it’s fast is because we can (sometimes) avoid system calls. And the reason we are able to avoid system calls is that, in the fast path, the threads coordinate by modifying a memory location that is accessible in userspace. By userspace, we mean that our futex variable is just an ordinary pointer that the threads all have direct access to: no system call is required to modify it.

By contrast, when we call futex_wait and futex_wake, the kernel needs to read and modify kernel data structures, hence a system call is required.

The code above should provide you with an intuition of how futexes are supposed to work. The tricky part is writing the algorithm in such a way as to guarantee correctness for all possible thread schedules. There’s a reason that Ulrich Drepper wrote a paper titled Futexes are Tricky: it’s easy to get the lock/unlock methods wrong.

Why futex_wait needs another agument

There are many potential race conditions in our initial lock/unlock implementation, but let’s focus on one in particular: if the futex gets freed after the lock checks if it’s free, but before calling futex_wait.

Here’s what the scenario looks like (think of the red arrows as being like breakpoints):

We need to prevent the situation where the unlock thread completes after the *futex == FREE check but before the futex_wait system call.

We can prevent this by modifying the futex_wait function prototype to pass the value we expect the futex to have, it looks like this:

void futex_wait(int *futex, int val);

The lock code then looks like this instead:

void lock(int *futex) {
    if(*futex == FREE) {
        *futex = ACQUIRED;
    } else {
        *futex = CONTENDED;
        futex_wait(futex, CONTENDED);
        // try again after waking up
        lock(futex); 
    }
}

The futex_wait system call will check to ensure that *futex == val. So, if the *futex gets changed, the function will return immediately rather than sleeping.

Here’s the FUTEX_WAIT section of the Linux futex man page, which hopefully should be clear now.

FUTEX_WAIT (since Linux 2.6.0)
This operation tests that the value at the futex word pointed to by the address uaddr still contains the expected value val, and if so, then sleeps waiting for a FUTEX_WAKE operation on the futex word. The load of the value of the futex word is an atomic memory access (i.e., using atomic machine instructions of the respective architecture). This load, the comparison with the expected value, and starting to sleep are performed atomically and totally ordered with respect to other futex operations on the same futex word. If the thread starts to sleep, it is considered a waiter on this futex word. If the futex value does not match val, then the call fails immediately with the error EAGAIN.

Atomic operations

In order for the lock/unlock implementations to guarantee correctness, we need to rely on what are called atomic operations when reading and modifying the futex across multiple threads. These are operations that the hardware guarantees can be performed atomically, so that there are no possible race conditions.

In my futex model, I assumed the existence of three atomic operations:

  1. atomic store
  2. atomic exchange
  3. atomic compare and exchange

Atomic store isn’t very interesting, it just says that we can atomically set the value of a variable, i.e., that when we do something like this, it happens atomically.

*futex = FREE

In PlusCal, atomic stores are just modeled as assigning a value to a variable, so there’s not much else to sta

Atomic exchange

Atomic exchange looks like this:

oldval = atomic_exchange(x, newval)

You give atomic exchange two arguments: a variable (x) you want to modify, and the new value (newval) you want it to have. The atomic_exchange function will atomically set x=newval and return the value x had before the assignment.

In PlusCal, I modeled it as a macro. Macros can’t return values, so we need to pass in oldval as an argument.

macro atomic_exchange(x, oldval, newval) begin
    oldval := x;
    x := newval;
end macro;

Then I can invoke it like this in my PlusCal model:

atomic_exchange(mem[a], uprev, Free);

And the resulting behavior is:

uprev := mem[a];
mem[a] := Free;

Atomic compare and exchange

Atomic compare and exchange is sometimes called test-and-set. It looks like this:

oldval = atomic_compare_exchange(x, expecting, newval)

It’s similar to atomic exchange, except that it only performs the exchange if the variable x contains the value expecting. Basically, it’s an atomic version of this:

if (x == expecting) 
  oldval = atomic_compare_exchange(x, newval)
else 
  oldval = x

In my PlusCal model, I also implemented it as a macro:

macro atomic_compare_exchange(x, oldval, expecting, newval) begin
    oldval := x;
    if x = expecting then 
        x := newval;
    end if;
end macro;

Futex modeling basics

Here are the basics constants and variables in my model:

CONSTANTS Processes, Addresses, Free, Acquired, Contended

(*--algorithm futex

variables
    mem = [x \in Addresses |-> Free],
    a \in Addresses,
    ...

In addition to modeling Processes, my futex model also models a set of memory address as Addresses. I also defined three constants: Free, Acquired, Contended which have the same role as FREE, ACQUIRED, and CONTENDED in the example C code above.

I model memory as a function (mem) that maps addresses to values, as well as a specific memory address (a). I use a as my futex.

At the top-level, the algorithm should look familiar, it’s basically the same as the mutex.tla one, except that it’s calling acquire_lock and release_lock procedures.

process proc \in Processes
begin
    ncs: skip;
    acq: call acquire_lock();
    cs:  skip;
    rel: call release_lock();
         goto ncs;
end process;

acquire_lock: implementing lock

I based my implementation of the lock on a simplified version of the one in Ulrich Drupper’s Futexes are Tricky paper.

Here’s the model for acquiring the lock. It doesn’t render particularly well in WordPress, so you may want to view the futex.tla file on GitHub directly.

procedure acquire_lock()
variable lprev;
begin
           \* Attempt to acquire the lock
Lcmpx1:    atomic_compare_exchange(mem[a], lprev, Free, Acquired);
Ltest:     while lprev # Free do
              \* Mark the lock as contended, assuming it's in the acquired state
Lcmpx2:       atomic_compare_exchange(mem[a], lprev, Acquired, Contended);
              if lprev # Free then
call_wait:        call futex_wait(a, Contended);
              end if;
              \* Attempt to acquire the lock again
Lcmpx3:       atomic_compare_exchange(mem[a], lprev, Free, Contended);
           end while;
            \* if we reach here, we have the lock
Lret:      return;
end procedure;

Note that we need three separate atomic_compare_exchange calls to implement this (each with different arguments!). Yipe!

release_lock: implementing unlock

The unlock implementation is much simpler. We just set the futex to Free and then wake the waiters.

procedure release_lock()
variable uprev;
begin
u_xch:  atomic_exchange(mem[a], uprev, Free);
u_wake: if uprev = Contended then
           call futex_wake(a);
        end if;
u_ret:  return;
end procedure;

Modeling the futex_wait/futex_wake calls

Finally, it’s not enough in our model to just invoke futex_wait and futex_wake, we need to model the behavior of these as well! I’ll say a little bit about the relevant kernel data structures in the Linux kernel and how I modeled them

Kernel data structures

The Linux kernel hashes futexes into buckets, and each of these bucket is associated with a futex_hash_bucket structure. For the purposes of our TLA+ model, the fields we care about are:

  • queue of threads (tasks) that are waiting on the futex
  • lock for protecting the structure

The kernel also uses a structure called a wake-queue (see here for more details) to keep a list of the tasks that have been selected to be woken up. I modeled this list of threads-to-be-woken as a set.

Here are the variables:

variables
...
    waitq = [x \in Addresses |-> <<>>], \* a map of addresses to wait queues
    qlock = {},  \* traditional mutex lock used by the kernel on the wait queue.
    wake = {}; \* processes that have been sent a signal to wake up

I made the following assumptions to simplify my model

  • every futex (address) is associated with one queue, rather than hashing and bucketing
  • I used a global lock instead of having a per-queue lock

futex_wait

Here’s the basic futex_wait algorithm (links go to Linux source code)

  1. Take the lock
  2. Retrieve the current value of the futex.
  3. Check if the passed value matches the futex value. If not, return.
  4. Add the thread to the wait queue
  5. Release the lock
  6. Wait to be woken up

Note: I assume that calling schedule() inside of the kernel at this point has the effect of putting the thread to sleep, but as I said earlier, I’m not a kernel programmer so I’m not familiar with this code at all.

Here’s my PlusCal model:

procedure futex_wait(addr, val)
begin
wt_acq:       await qlock = {};
              qlock := {self};
wt_valcheck:  if val /= mem[addr] then
                qlock := {};
                return;
              end if;
              \* Add the process to the wait queue for this address
wt_enq:       waitq[addr] := Append(waitq[addr], self);
              qlock := {};
wt_wait:      await self \in wake;
              wake := wake \ {self};
              return;
end procedure;

futex_wake

Here’s what the futex_wake algorithm looks like:

  1. Take the lock
  2. Add the tasks to be woken up to the wake queue
  3. Release the lock
  4. Wake the tasks on the wake queue

Here’s my PlusCal model of it. I chose to wake only one process in my model, but we could have have woken all of the waiting processes.

procedure futex_wake(addr)
variables nxt = {}
begin
wk_acq:   await qlock = {};
          qlock := {self};
wk_deq:   if waitq[addr] /= <<>> then
              nxt := {Head(waitq[addr])};
              waitq[addr] := Tail(waitq[addr]);
          end if;
wk_rel:   qlock := {};
wk_wake:  wake := wake \union nxt;
          return;
end procedure;

You can see the full model on GitHub at futex.tla.

Checking for properties

One of the reasons to model in TLA+ is to check properties of the specification. I care about three things with this specification:

  1. It implements mutual exclusion
  2. It doesn’t make system calls when there’s no contention
  3. Processes can’t get stuck waiting on the queue forever

Mutual exclusion

We check mutual exclusion the same way we did in our mutex.tla specification, by asserting that there are never two different processes in the critical section at the same time. This is our invariant.

MutualExclusion == \A p1,p2 \in Processes : pc[p1]="cs" /\ pc[p2]="cs" => p1=p2

No contention means no system calls

The whole point of using futexes to implement locks was to avoid system calls in the cases where there’s no contention. Even if our algorithm satisfies mutual exclusion, that doesn’t mean that it avoids these system calls.

I wrote an invariant for the futex_wait system call, that asserts that we only make the system call when there’s contention. I called the invariant OnlyWaitUnderContention, and here’s how I defined it. I created several helper definitions as well.

LockIsHeld == mem[a] /= Free
ProcessAttemptingToAcquireLock(p) == pc[p] \in {"Lcmpx1", "Ltest", "Lcmpx2", "call_wait", "Lcmpx3", "wt_acq", "wt_valcheck", "wt_enq", "wt_wait"}
Contention == LockIsHeld /\ \E p \in Processes : ProcessAttemptingToAcquireLock(p)
OnlyWaitUnderContention == \E p \in Processes : pc[p]="call_wait" => Contention

Nobody gets stuck waiting

Recall earlier in the blog post how we had to modify the prototype of the futex_wait system call to take an additional argument, in order to prevent a race condition that could leave a process waiting forever on the queue.

We want to make sure that we have actually addressed that risk. Note that the comments in the Linux source code specifically call out this risk.

I checked this by defining an invariant that stated that it never can happen that a process is waiting and all of the other processes are past the point where they could wake up the waiter.

Stuck(x) == /\ pc[x] = "wt_wait"
            /\ x \notin wake
            /\ \A p \in Processes \ {x} : \/ pc[p] \in {"ncs", "u_ret"}
                                          \/ /\ pc[p] \in {"wk_rel", "wk_wake"}
                                             /\ x \notin nxt[p]

NoneStuck == ~ \E x \in Processes : Stuck(x)

Final check: refinement

In addition to checking mutual exclusion, we can check that our futex-based lock model (futex.tla) implements our original high-level mutex model (mutex.tla) by means of a refinement mapping.

To do that, we need to define mappings between the futex model variables and the mutex model variables. The mutex model has two variables:

  • lock – the model of the lock
  • pc – the program counters for the processes

I called my mappings lockBar and pcBar. Here’s what the mapping looks like:

InAcquireLock(p) == pc[p] \in {"Lcmpx1", "Ltest", "Lcmpx2", "call_wait", "Lcmpx3", "Lret"}
InFutexWait(p) == pc[p] \in {"wt_acq", "wt_valcheck", "wt_enq", "wt_wait"}
InReleaseLockBeforeRelease(p) == pc[p] \in {"u_xch"}
InReleaseLockAfterRelease(p) == pc[p] \in {"u_wake", "u_ret"}
InFutexWake(p) == pc[p] \in {"wk_acq", "wk_deq", "wk_rel", "wk_wake"}

lockBar == {p \in Processes: \/ pc[p] \in {"cs", "rel"}
                             \/ InReleaseLockBeforeRelease(p)}


pcBar == [p \in Processes |->
            CASE pc[p] = "ncs"                 -> "ncs"
              [] pc[p] = "cs"                  -> "cs"
              [] pc[p] = "acq"                 -> "acq"
              [] InAcquireLock(p)              -> "acq"
              [] InFutexWait(p)                -> "acq"
              [] pc[p] = "rel"                 -> "rel"
              [] InReleaseLockBeforeRelease(p) -> "rel"
              [] InReleaseLockAfterRelease(p)  -> "ncs"
              [] InFutexWake(p)                -> "ncs"
]

mutex == INSTANCE mutex WITH lock <- lockBar, pc <- pcBar

We can then define a property that says that our futex specification implements the mutex specification:

ImplementsMutex == mutex!Spec

Finally, in our futex.cfg file, we can specify that we want to check the invariants, as well as this behavioral property. The relevant config lines look like this:

INVARIANT
    MutualExclusion
    OnlyWaitUnderContention
    NoneStuck

PROPERTY 
    ImplementsMutex

You can find my repo with these models at https://github.com/lorin/futex-tla.

You can specify even when you can’t implement

The other day, The Future of TLA+ (pdf) hit Hacker News. TLA+ is a specification language: it is intended for describing the desired behavior of a system. Because it’s a specification language, you don’t need to specify implementation details to describe desired behavior. This can be confusing to experienced programmers who are newcomers to TLA+, because when you implement something in a programming language, you can always use either a compiler or an interpreter to turn it into an executable. But specs written in a specification language aren’t executable, they’re descriptions of permitted behaviors.

Just for fun, here’s an example: a TLA+ specification for a program that solves the twin prime conjecture. According to this spec, when the program runs, it will set the answer variable to either “yes” if the twin prime conjecture is true, or “no” if the conjecture is false.

Unfortunately, if I was to try to write a program that implements this spec, I don’t know whether answer should be set to “yes” or “no”: the conjecture has yet to be proven or disproven. Despite this, have still formally specified its behavior!

Modeling a CLH lock in TLA+

The last post I wrote mentioned in passing how Java’s ReentrantLock class is implemented as a modified version of something called a CLH lock. I’d never heard of a CLH lock before, and so I thought it would be fun to learn more about it by trying to model it. My model here is based on the implementation described in section 3.1 of the paper Building FIFO and Priority-Queueing Spin Locks from Atomic Swap by Travis Craig.

This type of lock is a first-in-first-out (FIFO) spin lock. FIFO means that processes acquire the lock in order in which they requested it, also known as first-come-first-served. A spin lock is a type of lock where the thread runs in a loop continually checking the state of a variable to determine whether it can take the lock. (Note that Java’s ReentrantLock is not a spin lock, despite being based on CLH).

Also note that I’m going to use the terms process and thread interchangeably in this post. The literature generally uses the term process, but I prefer thread because of the shared-memory nature of the lock.

Visualizing the state of a CLH lock

I thought I’d start off by showing you what a CLH data structure looks like under lock contention. Below is a visualization of the state of a CLH lock when:

  • There are three processes (P1, P2, P3)
  • P1 has the lock
  • P2 and P3 are waiting on the lock
  • The order that the lock was requested in is: P1, P2, P3

The paper uses the term requests to refer to the shared variables that hold the granted/pending state of the lock, those are the boxes with the sharp corners. The head of the line is at the left, and the end of the line is at the right, with a tail pointer pointing to the final request in line.

Each process owns two pointers: watch and myreq. These pointers each point to a request variable.

If I am a process, then my watch pointer points to a request that’s owned by the process that’s ahead of me in line. The myreq pointer points to the request that I own, which is the request that the process behind in me line will wait on. Over time, you can imagine the granted value propagating to the right as the processes release the lock.

This structure behaves like a queue, where P1 is at the head and P3 is at the tail. But it isn’t a conventional linked list: the pointers don’t form a chain.

Requesting the lock

When a new process requests the lock, it:

  1. points its watch pointer to the current tail
  2. creates a new pending request object and points to it with its myreq pointer
  3. updates the tail to point to this new object

Releasing the lock

To release the lock, a process sets the value of its myreq request to granted. (It also takes ownership of the watch request, setting myreq to that request, for future use).

Modeling desired high-level behavior

Before we build our CLH model, let’s think about how we’ll verify if our model is correct. One way to do this is to start off with a description of the desired behavior of our lock without regard to the particular implementation details of CLH.

What we want out of a CLH lock is a lock that preserves the FIFO ordering of the requests. I defined a model I called FifoLock that has just the high-level behavior, namely:

  • it implements mutual exclusion
  • threads are granted the lock in first-come-first-served order

Execution path

Here’s the state transition diagram for each thread in my model. The arrows are annotated with the name of the TLA+ sub-actions in the model.

FIFO behavior

To model the FIFO behavior, I used a TLA+ sequence to implement a queue. The Request action appends the thread id to a sequence, and the Acquire action only grants the lock to the thread at the head of the sequence:

Request(thread) == 
    /\ queue' = Append(queue, thread)
    ...

Acquire(thread) ==
    /\ Head(queue) = thread
   ...

The FifoLock behavior model can be found at https://github.com/lorin/clh-tla/blob/main/FifoLock.tla.

The CLH model

Now let’s build an actual model of the CLH algorithm. Note that I called these processes instead of threads because that’s what the original paper does, and I wanted to stay close to the paper in terminology to help readers.

The CLH model relies on pointers. I used a TLA+ function to model pointers. I defined a variable named requests that maps a request id to the value of the request variable. This way, I can model pointers by having processes interact with request values using request ids as a layer of indirection.

---- MODULE CLH ----
...
CONSTANTS NProc, GRANTED, PENDING, X

first == 0 \* Initial request on the queue, not owned by any process
Processes == 1..NProc
RequestIDs == Processes \union {first}

VARIABLES
    requests, \* map request ids to request state
    ...

TypeOk ==
    /\ requests \in [RequestIDs -> {PENDING, GRANTED, X}]
   ...

(I used “X” to mean “don’t care”, which is what Craig uses in his paper).

The way the algorithm works, there needs to be an initial request which is not owned by any of the process. I used 0 as the Request id for this initial request and called it first.

The myreq and watch pointers are modeled as functions that map from a process id to a request id. I used -1 to model a null pointer.

NIL == -1

TypeOk ==
    /\ myreq \in [Processes -> RequestIDs]
    /\ watch \in [Processes -> RequestIDs \union {NIL}]
    ...

Execution path

The state-transition diagram for each process in our CLH model is similar to the one in our abstract (FifoLock) model. The only real difference is that requesting the lock is done in two steps:

  1. Update the request variable to pending (MarkPending)
  2. Append the request to the tail of the queue (EnqueueRequest)

The CLH model can be found at https://github.com/lorin/clh-tla/blob/main/CLH.tla.

Checking our CLH model’s behavior

We want to verify that our CLH model always behaves in a way consistent with our FifoLock model. We can check this in using the model checker by specifying what’s called a refinement mapping. We need to show that:

  1. We can create a mapping from the CLH model’s variables to the FifoLock model’s variables.
  2. Every valid behavior of the CLH model satisfies the FifoLock specification under this mapping.

The mapping

The FifoLock model has two variables, state and queue.

---- MODULE FifoLock ----
...
VARIABLES state, queue

TypeOK == /\ state \in [Threads -> {"ready", "requested", "acquired", "in-cs"}]
          /\ queue \in Seq(Threads)

The state variable is a function that maps each thread id to the state of that thread in its state machine. The dashed arrows show how I defined the state refinement mapping:

Note that two states in the CLH model (ready, to-enqueue) map to one state in FifoLock (ready).

For the queue mapping, I need to take the CLH representation (recall it looks like this)…

…and define a transformation so that I end up with a sequence that looks like this, which represents the queue in the FifoLock model:

<<P1, P2, P3>>

To do that, I defined a helper operator called QueueFromTail that, given a tail pointer in CLH, constructs a sequence of ids in the right order.

Unowned(request) ==
    ~ \E p \in Processes : myreq[p] = request

\* Need to reconstruct the queue to do our mapping
RECURSIVE QueueFromTail(_)
QueueFromTail(rid) ==
    IF Unowned(rid) THEN <<>> ELSE 
        LET tl == CHOOSE p \in Processes : myreq[p] = rid
            r == watch[tl]
        IN Append(QueueFromTail(r), tl)

Then I can define the mapping like this:

Mapping == INSTANCE FifoLock
    WITH queue <- QueueFromTail(tail),
         Threads <- Processes,
         state <- [p \in Processes |-> 
            CASE state[p]="ready" -> "ready"
              [] state[p]="to-enqueue" -> "ready"
              [] state[p]="waiting" -> "requested"
              [] state[p]="acquired" -> "acquired"
              [] state[p]="in-cs" -> "in-cs"]

Refinement == Mapping!Spec

In my CLH.cfg file, I added this line to check the refinement mapping:

PROPERTY 
    Refinement

Note again what we’re doing here for validation: we’re checking if our more detailed model faithfully implements the behavior of a simpler model (under a given mapping). Note that we’re using TLA+ to serve two different purposes.

  1. To specify the high-level behavior that we consider correct (our behavioral contract)
  2. To model a specific algorithm (our implementation)

Because TLA+ allows us to choose the granularity of our model, we can use it for both high-level specs of desired behavior and low-level specifications of an algorithm.

You can find my model at https://github.com/lorin/clh-tla

Reproducing a Java 21 virtual threads deadlock scenario with TLA+

Recently, some of my former colleagues wrote a blog post on the Netflix Tech Blog about a particularly challenging performance issue they ran into in production when using the new virtual threads feature of Java 21. Their post goes into a lot of detail on how they conducted their investigation and finally figured out what the issue was, and I highly recommend it. They also include a link to a simple Java program that reproduces the issue.

I thought it would be fun to see if I could model the system in enough detail in TLA+ to be able to reproduce the problem. Ultimately, this was a deadlock issue, and one of the things that TLA+ is good at is detecting deadlocks. While reproducing a known problem won’t find any new bugs, I still found the exercise useful because it helped me get a better understanding of the behavior of the technologies that are involved. There’s nothing like trying to model an existing system to teach you that you don’t actually understand the implementation details as well as you thought you did.

This problem only manifested when the following conditions were all present in the execution:

  1. virtual threads
  2. platform threads
  3. All threads contending on a lock
  4. some virtual threads trying to acquire the lock when inside a synchronized block
  5. some virtual threads trying to acquire the lock when not inside a synchronized block

To be able to model this, we need to understand some details about:

  • virtual threads
  • synchronized blocks and how they interact with virtual threads
  • Java lock behavior when there are multiple threads waiting for the lock

Virtual threads

Virtual threads are very well-explained in JEP 444, and I don’t think I can really improve upon that description. I’ll provide just enough detail here so that my TLA+ model makes sense, but I recommend that you read the original source. (Side note: Ron Pressler, one of the authors of JEP 444, just so happens to be a TLA+ expert).

If you want to write a request-response style application that can handle multiple concurrent requests, then programming in a thread-per-request style is very appealing, because the style of programming is easy to reason about.

The problem with thread-per-request is that each individual thread consumes a non-trivial amount of system resources (most notably, dedicated memory for the call stack of each thread). If you have too many threads, then you can start to see performance issues. But in thread-per-request model, the number of concurrent requests your application service is capped by the number of threads. The number of threads your system can run can become the bottleneck.

You can avoid the thread bottleneck by writing in an asynchronous style instead of thread-per-request. However, async code is generally regarded as more difficult for programmers to write than thread-per-request.

Virtual threads are an attempt to keep the thread-per-request programming model while reducing the resource cost of individual threads, so that programs can run comfortably with many more threads.

The general approach of virtual threads is referred to as user-mode threads. Traditionally, threads in Java are wrappers around operating-system (OS) level threads, which means there is a 1:1 relationship between a thread from the Java programmer’s perspective and the operating system’s perspective. With user-mode threads, this is no longer true, which means that virtual threads are much “cheaper”: they consume fewer resources, so you can have many more of them. They still have the same API as OS level threads, which makes them easy to use by programmers who are familiar with threads.

However, you still need the OS level threads (which JEP 444 refers to as platform threads), as we’ll discuss in the next session.

How virtual threads actually run: carriers

For a virtual thread to execute, the scheduler needs to assign a platform thread to it. This platform thread is called the carrier thread, and JEP 444 uses the term mounting to refer to assigning a virtual thread to a carrier thread. The way you set up your Java application to use virtual threads is that you configure a pool of platform threads that will back your virtual threads. The scheduler will assign virtual threads to carrier threads to the pool.

A virtual thread may be assigned to different platform threads throughout the virtual thread’s lifetime: the scheduler is free to unmount a virtual thread from one platform thread and then later mount the virtual thread onto a different platform thread.

Importantly, a virtual thread needs to be mounted to a platform thread in order to run. If it isn’t mounted to a carrier thread, the virtual thread can’t run.

Synchronized

While virtual threads are new to Java, the synchronized statement has been in the language for a long time. From the Java docs:

A synchronized statement acquires a mutual-exclusion lock (§17.1) on behalf of the executing thread, executes a block, then releases the lock. While the executing thread owns the lock, no other thread may acquire the lock.

Here’s an example of a synchronized statement used by the Brave distributed tracing instrumentation library, which is referenced in the original post:

final class RealSpan extends Span {
  ...
  final PendingSpans pendingSpans;
  final MutableSpan state;

  @Override public void finish(long timestamp) {
    synchronized (state) {
      pendingSpans.finish(context, timestamp);
    }
  }

In the example method above, the effect of the synchronized statement is to:

  1. Acquire the lock associated with the state object
  2. Execute the code inside of the block
  3. Release the lock

(Note that in Java, every object is associated with a lock). This type of synchronization API is sometimes referred to as a monitor.

Synchronized blocks are relevant here because they interact with virtual threads, as described below.

Virtual threads, synchronized blocks, and pinning

Synchronized blocks guarantee mutual exclusion: only one thread can execute inside of the synchronized block. In order to enforce this guarantee, the Java engineers made the decision that, when a virtual thread enters a synchronized block, it must be pinned to its carrier thread. This means that the scheduler is not permitted to unmount a virtual thread from a carrier until the virtual thread exits the synchronized block.

ReentrantLock as FIFO

The last piece of the puzzle is related to the implementation details of a particular Java synchronization primitive: the ReentrantLock class. What’s important here is how this class behaves when there are multiple threads contending for the lock, and the thread that currently holds the lock releases it.

The ReentrantLock class delegates the acquire and release logic behavior to the AbstractQueuedSynchronizer class. As noted in the source code comments, this class is a variation of a type of lock called a CLH lock, which was developed independently by Travis Craig in his paper Building FIFO and Priority-Queueing Spin Locks from Atomic Swap, and by Erik Hagersten and Anders Landin in their paper Queue locks on cache coherent multiprocessors, where they refer to it as an LH lock. (Note that the Java implementation is not a spin lock).

The relevant implementation detail is that this lock is a FIFO: it maintains a queue with the identity of the threads that are waiting on the lock, ordered by arrival. As you can see in the code, after a thread releases a lock, the next thread in line gets notified that the lock is free.

Modeling in TLA+

The sets of threads

I modeled the threads using three constant sets:

CONSTANTS VirtualThreads,
          CarrierThreads,
          FreePlatformThreads

VirtualThreads is the set of virtual threads that execute the application logic. CarrierThreads is the set of platform threads that will act as carriers for the virtual threads. To model the failure mode that was originally observed in the blog post, I also needed to have platform threads that will execute the application logic. I called these FreePlatformThreads, because these platform threads are never bound to virtual threads.

I called the threads that directly execute the application logic, whether virtual or (free) platform, “work threads”:

PlatformThreads == CarrierThreads \union FreePlatformThreads
WorkThreads == VirtualThreads \union FreePlatformThreads

Here’s a Venn diagram:

The state variables

My model has five variables:

VARIABLES pc,  
          schedule, 
          inSyncBlock,
          pinned,
          lockQueue 

TypeOk == /\ pc \in [WorkThreads -> {"ready", "to-enter-sync", "requested", "locked", "in-critical-section", "to-exit-sync", "done"}]
          /\ schedule \in [WorkThreads -> PlatformThreads \union {NULL}]
          /\ inSyncBlock \subseteq VirtualThreads
          /\ pinned \subseteq CarrierThreads
          /\ lockQueue \in Seq(WorkThreads)
  • pc – program counter for each work thread
  • schedule – function that describe how work threads are mounted to platform threads
  • inSyncBlock – the set of virtual threads that are currently inside a synchronized block
  • pinned – the set of carrier threads that are currently pinned to a virtual thread
  • lockQueue – a queue of work threads that are currently waiting for the lock

My model only models a single lock, which I believe is this lock in the original scenario. Note that I don’t model the locks associated with the synchronization blocks, because there was no contention associated with those locks in the scenario I’m modeling: I’m only modeling synchronization for the pinning behavior.

Program counters

Each worker thread can take one of two execution paths through the code. Here’s a simplified view of the two different code possible paths: the left code path is when a thread enters a synchronized block before it tries to acquire the lock, and the right code path is when a thread does not enter a synchronized block before it tries to acquire the lock.

This is conceptually how my model works, but I modeled the left-hand path a little differently than this diagram. The program counters in my model actually transition back from “to-enter-sync” to “ready”, and they use a separate state variable (inSyncBlock) to determine whether to transition to the to-exit-sync state at the end. The other difference is that the program counters in my model also will jump directly from “ready” to “locked” if the lock is currently free. But showing those state transitions directly would make the above diagram harder to read.

The lock

The FIFO lock is modeled as a TLA+ sequence of work threads. Note that I don’t need to actually model whether the lock is held or not. My model just checks that the thread that wants to enter the critical section is at the head of the queue, and when it leaves the critical section, it gets removed from the head. The relevant actions look like this:

AcquireLock(thread) ==
    /\ pc[thread] = "requested"
    /\ Head(lockQueue) = thread
    ...

ReleaseLock(thread) ==
    /\ pc[thread] = "in-critical-section"
    /\ lockQueue' = Tail(lockQueue)
    ...

To check if a thread has the lock, I just check if it’s in a state that requires the lock:

\* A thread has the lock if it's in one of the pc that requires the lock
HasTheLock(thread) == pc[thread] \in {"locked", "in-critical-section"}

I also defined this invariant that I checked with TLC to ensure that only the thread at the head of the lock queue can be in any of these states.

HeadHasTheLock ==
    \A t \in WorkThreads : HasTheLock(t) => t = Head(lockQueue)

Pinning in synchronized blocks

When a virtual thread enters a synchronized block, the corresponding carrier thread gets marked as pinned. When it leaves a synchronized block, it gets unmarked as pinned.

\* We only care about virtual threads entering synchronized blocks
EnterSynchronizedBlock(virtual) ==
    /\ pc[virtual] = "to-enter-sync"
    /\ pinned' = pinned \union {schedule[virtual]}
    ...

ExitSynchronizedBlock(virtual) ==
    /\ pc[virtual] = "to-exit-sync"
    /\ pinned' = pinned \ {schedule[virtual]}

The scheduler

My model has an action named Mount that mounts a virtual thread to a carrier thread. In my model, this action can fire at any time, whereas in the actual implementation of virtual threads, I believe that thread scheduling only happens at blocking calls. Here’s what it looks like:

\* Mount a virtual thread to a carrier thread, bumping the other thread
\* We can only do this when the carrier is not pinned, and when the virtual threads is not already pinned
\* We need to unschedule the previous thread
Mount(virtual, carrier) ==
    LET carrierInUse == \E t \in VirtualThreads : schedule[t] = carrier 
         prev == CHOOSE t \in VirtualThreads : schedule[t] = carrier
    IN /\ carrier \notin pinned
       /\ \/ schedule[virtual] = NULL 
          \/ schedule[virtual] \notin pinned
       \* If a virtual thread has the lock already, then don't pre-empt it.
       /\ carrierInUse => ~HasTheLock(prev)
       /\ schedule' = IF carrierInUse
                      THEN [schedule EXCEPT ![virtual]=carrier, ![prev]=NULL]
                      ELSE [schedule EXCEPT ![virtual]=carrier]
       /\ UNCHANGED <<lockQueue, pc, pinned, inSyncBlock>>

My scheduler also won’t pre-empt a thread that is holding the lock. That’s because, in the original blog post, nobody was holding the lock during deadlock, and I wanted to reproduce that specific scenario. So, if a thread gets a lock, the scheduler won’t stop it from releasing the lock. And, yet, we can still get a deadlock!

Only executing while scheduled

Every action is only allowed to fire if the thread is scheduled. I implemented it like this:

IsScheduled(thread) == schedule[thread] # NULL

\* Every action has this pattern:
Action(thread) ==
  /\ IsScheduled(thread)
  ....

Finding the deadlock

Here’s the config file I used to run a simulation. I had defined three virtual threads, two carrier threads, and one free platform threads.

INIT Init
NEXT Next

CONSTANTS 
    NULL = NULL

    VirtualThreads = {V1, V2, V3}
    CarrierThreads = {C1, C2}
    FreePlatformThreads = {T1}

CHECK_DEADLOCK TRUE

INVARIANTS 
    TypeOk
    MutualExclusion
    PlatformThreadsAreSelfScheduled
    VirtualThreadsCantShareCarriers
    HeadHasTheLock

And, indeed, the model checker finds a deadlock! Here’s what the trace looks like in the Visual Studio Code plugin.

Note how virtual thread V2 is in the “requested” state according to its program counter, and V2 is at the head of the lockQueue, which means that it is able to take the lock, but the schedule shows “V2 :> NULL”, which means that the thread is not scheduled to run on any carrier threads.

And we can see in the “pinned” set that all of the carrier threads {C1,C2} are pinned. We also see that the threads V1 and V3 are behind V2 in the lock queue. They’re blocked waiting for the lock, and they’re holding on to the carrier threads, so V2 will never get a chance to execute.

There’s our deadlock!

You can find my complete model on GitHub at: https://github.com/lorin/virtual-threads-tla/