Locks, leases, fencing tokens, FizzBee!

FizzBee is a new formal specification language, originally announced back in May of last year. FizzBee’s author, Jayaprabhakar (JP) Kadarkarai, reached out to me recently and asked me what I thought of it, so I decided to give it a go.

To play with FizzBee, I decided to model some algorithms that solve the mutual exclusion problem, more commonly known as locking. Mutual exclusion algorithms are a classic use case for formal modeling, but here’s some additional background motivation: a few years back, there was an online dust-up between Martin Kleppmann (author of the excellent book Designing Data-Intensive Applications, commonly referred to as DDIA) and Salvatore Sanfilippo (creator of Redis, and better known by his online handle antirez). They were arguing about the correctness of an algorithm called Redlock that claims to achieve fault-tolerant distributed locking. Here are some relevant links:

As a FizzBee exercise, I wanted to see how difficult it was to model the problem that Kleppmann had identified in Redlock.

Keep in mind here that I’m just a newcomer to the language writing some very simple models as a learning exercise.

Critical sections

Here’s my first FizzBee model, it models the execution of two processes, with an invariant that states that at most one process can be in the critical section at a time. Note that this model doesn’t actually enforce mutual exclusion, so I was just looking to see that the assertion was violated.

# Invariant to check
always assertion MutualExclusion:
    return not any([p1.in_cs and p2.in_cs for p1 in processes
                                          for p2 in processes
                                          if p1 != p2])
NUM_PROCESSES = 2

role Process:
    action Init:
        self.in_cs = False

    action Next:
        # before critical section
        pass

        # critical section
        self.in_cs = True
        pass

        # after critical section
        self.in_cs = False
        pass

action Init:
    processes = []
    for i in range(NUM_PROCESSES):
        processes.append(Process())

The “pass” statements are no-ops, I just use them as stand-ins for “code that would execute before/during/after the critical section”.

FizzBee is built on Starlark, which is a subset of Python, which why the model looks so Pythonic. Writing a FizzBee model felt like writing a PlusCal model, without the need for specifying labels explicitly, and also with a much more familiar syntax.

The lack of labels was both a blessing and a curse. In PlusCal, the control state is something you can explicitly reference in your model. This is useful for when you want to specify a critical section as an invariant. Because FizzBee doesn’t have labels, I had to create a separate variable called “in_cs” to be able to model when a process was in its critical section. In general, though, I find PlusCal’s label syntax annoying, and I’m happy that FizzBee doesn’t require it.

FizzBee has an online playground: you can copy the model above and paste it directly into the playground and click “Run”, and it will tell you that the invariant failed.

FAILED: Model checker failed. Invariant:  MutualExclusion

The “Error Formatted” view shows how the two processes both landed on line 17, hence violating mutual exclusion:

Locks

Next up, I modeled locking in FizzBee. In general, I like to model a lock as a set, where taking the lock means adding the id of the process to the set, because if I need to, I can see:

  • who holds the lock by the elements of the set
  • if two processes somehow manage to take the same lock (multiple elements in the set)

Here’s my FizzBee mdoel:

always assertion MutualExclusion:
    return not any([p1.in_cs and p2.in_cs for p1 in processes
                                          for p2 in processes
                                          if p1 != p2])

NUM_PROCESSES = 2

role Process:
    action Init:
        self.in_cs = False

    action Next:
        # before critical section
        pass

        # acquire lock
        atomic:
            require not lock
            lock.add(self.__id__)

        #
        # critical section
        #
        self.in_cs = True
        pass
        self.in_cs = False

        # release lock
        lock.clear()

        # after critical section
        pass

action Init:
    processes = []
    lock = set()
    in_cs = set()
    for i in range(NUM_PROCESSES):
        processes.append(Process())

By default, each statement in FizzBee is treated atomically, and you can specify an atomic block to treat multiple statements automatically.

If you run this in the playground, you’ll see that the invariant holds, but there’s a different problem: deadlock

DEADLOCK detected
FAILED: Model checker failed

FizzBee’s model checker does two things by default:

  1. Checks for deadlock
  2. Assumes that a thread can crash after any arbitrary statement

In the “Error Formatted” view, you can see what happened. The first process took the lock and then crashed. This leads to deadlock, because the lock never gets released.

Leases

If we want to build a fault-tolerant locking solution, we need to handle the scenario where a process fails while it owns the lock. The Redlock algorithm uses the concept of a lease, which is a lock that expires after a period of time.

To model leases, we now need to model time. To keep things simple, my model assumes a global clock that all processes have access to.

NUM_PROCESSES = 2
LEASE_LENGTH = 10


always assertion MutualExclusion:
    return not any([p1.in_cs and p2.in_cs for p1 in processes
                                          for p2 in processes
                                          if p1 != p2])

action AdvanceClock:
    clock += 1

role Process:
    action Init:
        self.in_cs = False

    action Next:
        atomic:
            require lock.owner == None or \
                    clock >= lock.expiration_time
            lock = record(owner=self.__id__,
                          expiration_time=clock+LEASE_LENGTH)

        # check that we still have the lock
        if lock.owner == self.__id__:
            # critical section
            self.in_cs = True
            pass
            self.in_cs = False

            # release the lock
            if lock.owner == self.__id__:
                lock.owner = None

action Init:
    processes = []
    # global clock
    clock = 0
    lock = record(owner=None, expiration_time=-1)
    for i in range(NUM_PROCESSES):
        processes.append(Process())

Now the lock has an expiration date, so we don’t have the deadlock problem anymore. But the invariant is no longer always true.

FizzBee also has a neat view called the “Explorer” where you can step through and see how the state variables change over time. Here’s a screenshot, which shows the problem:

The problem is that one process can think it holds the lock, but it the lock has actually expired, which means another process can take the lock, and they can both end up in the critical section.

Fencing tokens

Kleppmann noted this problem with Redlock, that it was vulnerable to issues where a process’s execution could pause for some period of time (e.g., due to garbage collection). Kleppmann proposed using fencing tokens to prevent a process from accessing a shared resource with an expired lock.

Here’s how I modeled fencing tokens:

NUM_PROCESSES = 2
LEASE_LENGTH = 10

always assertion MutualExclusion:
    return not any([p1.in_cs and p2.in_cs for p1 in processes
                                          for p2 in processes
                                          if p1 != p2])

atomic action AdvanceClock:
    clock += 1

role Process:
    action Init:
        self.in_cs = False

    action Next:
        atomic:
            require lock.owner == None or \
                    clock >= lock.expiration_time
            lock = record(owner=self.__id__,
                          expiration_time=clock+LEASE_LENGTH)
            self.token = next_token
            next_token += 1

        # can only enter the critical section
        # if we have the highest token seen so far
        atomic:
            if self.token > last_token_seen:
                last_token_seen = self.token

                # critical section
                self.in_cs = True
                pass

        # after critical section
        self.in_cs = False

        # release the lock
        atomic:
            if lock.owner == self.__id__:
                lock.owner = None

action Init:
    processes = []
    # global clock
    clock = 0

    next_token = 1
    last_token_seen = 0
    lock = record(owner=None, expiration_time=-1)
    for i in range(NUM_PROCESSES):
        processes.append(Process())

However, if you run this through the model checker, you’ll discover that the invariant is also violated!

It turns out that fencing tokens don’t protect against the scenario where two processes both believe they hold the lock, and the lower token reaches the shared resource before the higher token:

A scenario where fencing tokens don’t ensure mutual exclusion

I reached out to Martin Kleppmann to ask about this, and he agreed that fencing tokens would not protect against this scenario.

Impressions

I found FizzBee surprisingly easy to get started with, although I only really scratched the surface here. In my case, having experience with PlusCal helped a lot, as I already knew how to write my specifications in a similar style. You can write your specs in TLA+ style, as a collection of atomic actions rather than as one big non-atomic action, but the PlusCal-style felt more natural for these particular problems I was modeling.

The Pythonic syntax will be much more familiar to programmers than PlusCal and TLA+, which should help with adoption. In some cases, though I found myself missing the conciseness of the set notation that languages like TLA+ and Alloy support. I ended up leveraging Python’s list comprehensions, which have a set-builder-notation feel to them.

Newcomers to formal specification will still have to learn how to think in terms of TLA+ style models: while FizzBee looks like Python, conceptually it is like TLA+, a notation for specifying a set of state-machine behaviors, which is very different from a Python program. I don’t know what it will be like for learners.

I was a little bit confused by FizzBee’s default behavior of a thread being able to crash at any arbitrary point, but that’s configurable, and I was able to use it to good effect to show deadlock in the lock model above.

Finally, while I read Kleppmann’s article years ago, I never noticed the issue with fencing tokens until I actually tried to model it explicitly. This is a good reminder of the value of formally specifying an algorithm. I fooled myself into thinking I understood it, but I actually hadn’t. It wasn’t until I went through the exercise of modeling it that I discovered something about its behavior that I hadn’t realized before.

4 thoughts on “Locks, leases, fencing tokens, FizzBee!

  1. Hey Lorin. Great article, thanks a lot. I have a question about the fencing token: As far as I understand, the fencing token has to be enforced by the shared resource being accessed. Two components compete for the lock. Either or both components may believe to have acquired the lock. Now they send a command to the shared resource together with their fencing token. The shared resource enforces to only accept commands with the highest fencing token it has seen. This setup does not prevent two components to believe to be the leader but will prevent two components to be able to act as the leader at the same time. (With a delay, but with ordering guarantees. Event if the other component acquired the lock, the shared resource will accept commands from the previous holder until it has received a command from the latest holder). So the protocol should indeed be able to enforce mutual exclusion. No?

    1. Fencing tokens only prevents the scenario where the lower token tries to access the shared resource after the higher token. If the timing is such that the lower token happens to get there first (imagine the lower token reaches the shared resource *just before* the higher token), then the monotonicity of the tokens doesn’t help: the two requests will overlap on the shared resource.

      1. Which has the intended effect, no? No two components are in the critical section together i.e. no two components can command the shared resource at the same time and the ordering of leaders is respected (first c1 was the leader and the shared resource only accepted commands from c2, then c2 became the leader and after first witnessing a command from c2, the shared resource only accepts commands from c2)

      2. No, the two components can still overlap in time on the critical resource. For this application (mutual exclusion) the ordering doesn’t matter, all that matters is that the two clients don’t overlap in time on the shared resource.

        But I would encourage you to build your own model and define the safety property!

Leave a comment