Models, models every where, so let’s have a think

If you’re a regular reader of this blog, you’ll have noticed that I tend to write about two topics in particular:

  1. Resilience engineering
  2. Formal methods

I haven’t found many people who share both of these interests.

At one level, this isn’t surprising. Formal methods people tend to have an analytic outlook, and resilience engineering people tend to have a synthetic outlook. You can see the clear distinction between these two perspectives in the transcript of Leslie Lamport’s talk entitled The Future of Computing: Logic or Biology. Lamport is clearly on the side of the logic, so much so that he ridicules the very idea of taking a biological perspective on software systems. By contrast, resilience engineering types actively look to biology for inspiration on understanding resilience in complex adaptive systems. A great example of this is the late Richard Cook’s talk on The Resilience of Bone.

And yet, the two fields both have something in common: they both recognize the value of creating explicit models of aspects of systems that are not typically modeled.

You use formal methods to build a model of some aspect of your software system, in order to help you reason about its behavior. A formal model of a software system is a partial one, typically only a very small part of the system. That’s because it takes effort to build and validate these models: the larger the model, the more effort it takes. We typically focus our models on a part of the system that humans aren’t particularly good at reasoning about unaided, such as concurrent or distributed algorithms.

The act of creating and explicit model and observing its behavior with a model checker gives you a new perspective on the system being modeled, because the explicit modeling forces you to think about aspects that you likely wouldn’t have considered. You won’t say “I never imagined X could happen” when building this type of formal model, because it forces you to explicitly think about what would happen in situations that you can gloss over when writing a program in a traditional programming language. While the scope of a formal model is small, you have to exhaustively specify the thing within the scope you’ve defined: there’s no place to hide.

Resilience engineering is also concerned with explicit models, in two different ways. In one way, resilience engineering stresses the inherent limits of models for reasoning about complex systems (c.f., itsonlyamodel.com). Every model is incomplete in potentially dangerous ways, and every incident can be seen through the lens of model error: some model that we had about the behavior of the system turned out to be incorrect in a dangerous way.

But beyond the limits of models, what I find fascinating about resilience engineering is the emphasis on explicitly modeling aspects of the system that are frequently ignored by traditional analytic perspectives. Two kinds of models that come up frequently in resilience engineering are mental models and models of work.

A resilience engineering perspective on an incident will look to make explicit aspects of the practitioners’ mental models, both in the events that led up to that incident, and in the response to the incident. When we ask “How did the decision make sense at the time?“, we’re trying to build a deeper understanding of someone else’s state of mind. We’re explicitly trying to build a descriptive model of how people made decisions, based on what information they had access to, their beliefs about the world, and the constraints that they were under. This is a meta sort of model, a model of a mental model, because we’re trying to reason about how somebody else reasoned about events that occurred in the past.

A resilience engineering perspective on incidents will also try to build an explicit model of how work happens in an organization. You’ll often heard the short-hand phrase work-as-imagined vs work-as-done to get at this modeling, where it’s the work-as-done that is the model that we’re after. The resilience engineering perspective asserts that the documented processes of how work is supposed to happen is not an accurate model of how work actually happens, and that the deviation between the two is generally successful, which is why it persists. From resilience engineering types, you’ll hear questions in incident reviews that try to elicit some more details about how the work really happens.

Like in formal methods, resilience engineering models only get at a small part of the overall system. There’s no way we can build complete models of people’s mental models, or generate complete descriptions of how they do their work. But that’s ok. Because, like the models in formal methods, the goal is not completeness, but insight. Whether we’re building a formal model of a software system, or participating in a post-incident review meeting, we’re trying to get the maximum amount of insight for the modeling effort that we put in.

Paxos made visual in FizzBee

Unfortunately, Paxos is quite difficult to understand, in spite of numerous attempts to make it more approachable. — Diego Ongaro and John Ousterhout, In Search of an Understandable Consensus Algorithm.

In fact, [Paxos] is among the simplest and most obvious of distributed algorithms. — Leslie Lamport, Paxos Made Simple.

I was interested in exploring FizzBee more, specifically to play around with its functionality for modeling distributed systems. In my previous post about FizzBee, I modeled a multithreaded system where coordination happened via shared variables. But FizzBee has explicit support for modeling message-passing in distributed systems, and I wanted to give that a go.

I also wanted to use this as an opportunity to learn more about a distributed algorithm that I had never modeled before, so I decided to use it to model Leslie Lamport’s Paxos algorithm for solving the distributed consensus problem. Examples of Paxos implementations in the wild include Amazon’s DynamoDB, Google’s Spanner, Microsoft Azure’s Cosmos DB, and Cassandra. But it has a reputation of being difficult to understand.

You can see my FizzBee model of Paxos at https://github.com/lorin/paxos-fizzbee/blob/main/paxos-register.fizz.

What problem does Paxos solve?

Paxos solves what is known as the consensus problem. Here’s how Lamport describes the requirements for conensus.

Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value.

I’ve always found the term chosen here to be confusing. In my mind, it invokes some agent in the system doing the choosing, which implies that there must be a process that is aware of which value is the chosen consensus value once it the choice has been made. But that isn’t actually the case. In fact, it’s possible that a value has been chosen without any one process in the system knowing what the consensus value is.

One way to verify that you really understand a concept is to try to explain it in different words. So I’m going to recast the problem to implementing a particular abstract data type: a single-assignment register.

Single assignment register

A register is an abstract data type that can hold a single value. It supports two operations: read and write. You can think of a register like a variable in a programming language.

A single assignment register that can only be written to once. Once a client writes to the register, all future writes will fail: only reads will succeed. The register starts out with a special uninitialized value, the sort of thing we’d represent as NULL in C or None in Python.

If the register has been written to, then a read will return the written value.

Only one write can succeed against a single assignment register. In this example, it is the “B” write that succeeds.

Some things to note about the specification for our single assignment register:

  • We doesn’t say anything about which write should succeed, we only care that at most one write succeeds.
  • The write operations don’t return a value, so the writers don’t receive information about whether the write succeeded. The only way to know if a write succeeded is to perform a read.

Instead of thinking of Paxos as a consensus algorithm, you can think of it as implementing a single assignment register. The chosen value is the value where the write succeeds.

I used Lamport’s Paxos Made Simple paper as my guide for modeling the Paxos algorithm. Here’s the mapping between terminology used in that paper and the alternate terminology that I’m using here.

Paxos Made Simple paperSingle assignment register (this blog post)
choosing a valuequorum write
proposerswriters
acceptorsstorage nodes
learnersreaders
accepted proposallocal write
proposal numberlogical clock

As a side note: if you ever wanted to practice doing a refinement mapping with TLA+, you could take one of the existing TLA+ Paxos models and see if you can define a refinement mapping to a single assignment register.

Making our register fault-tolerant with quorum write

One of Paxos’s requirements is that it is fault tolerant. That means a solution that implements a single assignment register using a single node isn’t good enough, because that node might fail. We need multiple nodes to implement our register:

Our single assignment register must be implemented using multiple nodes. The red square depicts a failed node.

If you’ve ever used a distributed database like DynamoDB or Cassandra, then you’re likely familiar with how they use a quorum strategy, where a single write or read may resulting in queries against multiple database nodes.

You can think of Paxos as implementing a distributed database that consists of one single assignment register, where it implements quorum writes.

The way these writes work are:

  1. The writer selects a quorum of nodes to attempt to write to: this is a set of nodes that must contain at least a majority. For example, if the entire cluster contains five nodes, then a quorum must contain at least three.
  2. If the writer attempts to write to every node in the quorum it has selected.

In Lamport’s original paper that introduced Paxos, The Part-Time Parliament, he showed a worked out example of a Paxos execution. Here’s that figure, with some annotations that I’ve added to describe it in terms of a single assignment quorum write register.

In this example, there are five nodes in the cluster, designated by Greek letters {Α,Β,Γ,Δ,Ε}.

The number (#) column acts as a logical clock, we’ll get to that later.

The decree column shows the value that a client attempts to write. In this example, there are two different values that clients attempt to write: {α,β}.

The quorum and voters columns indicate which nodes are in the quorum that the writer selected. A square around a node indicates that the write succeeded against that node. In this example, a quorum must contain at least three nodes, though it can have more than three: the quorum in row 5 contains four nodes.

Under this interpretation, in the first row, the write operation with the argument α succeeded on node Δ: there was a local write to node Δ, but there was not yet a quorum write, as it only succeeded on one node.

While the overall algorithm implements a single assignment register, the individual nodes themselves do not behave as single assignment registers: the value written to a node in can potentially change during the execution of the Paxos algorithm. In the example above, in row 27, the value β is successfully written to node Δ, which is different from the value α written to that node in row 2.

Safety condition: can’t change a majority

The write to our single assignment register occurs when there’s a quorum write: when a majority of the nodes have the same value written to them. To enforce single assignment, we cannot allow a majority of nodes to see a different written value over time.

Here’s how I expressed that safety condition in FizzBee, where written_values is a history variable that keeps track of which values were successfully written to a majority of nodes.

# Only a single value is written
always assertion SingleValueWritten:
    return len(written_values)<=1

Here’s an example scenario that would violate that invariant:

In this scenario, there are three nodes {a,b,c} and two writers. The first writer writes the value x to nodes a and b. As a consequence, x is the value written to the majority of nodes. The second writer writes the value y to nodes b and c, and so y becomes the value written to the majority of nodes. This means that the set of values written is: {x, y}. Because our single assignment register only permits one value to be registered, the algorithm must ensure that a scenario like this does not occur.

Paxos uses two strategies to prevent writes that could change the majority:

  1. Read-before-write to prevent clobbering a known write
  2. Unique, logical timestamps to prevent concurrent writes

Read before write

In Paxos, a writer will first do a read against all of the nodes in its quorum. If any node already contains a write, the writer will use the existing written value.

In the first phase, writer 2 reads a value x from node b. In phase 2, it writes x instead of y to avoid changing the majority.

Preventing concurrent writes

The read-before-write approach works if writer 2 tries to do a write after writer 1 has completed its write. But if the writes overlap, then this will not prevent one writer from clobbering the other writer’s quorum write:

Writer 2 clobbers writer 1’s write on node b because the writer 2’s write had not happened yet when writer 1 did its read.

Paxos solves this by using a logical clock scheme to ensure that only one concurrent writer can succeed. Note that Lamport doesn’t refer to it as a logical clock, but I found it useful to think of it this way.

Each writer has a local clock which is set to a different value. When the writer makes read or write calls, It passes the time of the clock as an additional argument.

Each storage node keeps a logical clock. This storage node’s clock is updated by a read call: if the timestamp of the read call is later than the storage node’s local clock, then the node will advance its clock to match the read timestamp. The node will reject writes with timestamps that are dated before its clock.

Node b reject writer 2’s write

In the example above, node b rejects writer 1’s write because the write has a timestamp of 1, and node b has a logical clock value of 2. As a consequence, a quorum write only occurs when writer 2 completes its write.

Readers

The writes are the interesting part of Paxos, which is where I focused. In my FizzBee model, I chose the simplest way to implement readers: a pub-sub approach where each node publishes out each successful write to all of the readers.

A simple reader implementation is to broadcast each local write to all of the readers.

The readers then keep a tally of the writes that have occurred on each node, and when they identify a majority, they record it.

Modeling with FizzBee

For my FizzBee model, I defined three roles:

  1. Writer
  2. StorageNode
  3. Reader

Writer

There are two phases to the writes. I modeled each phase as an action. Each writer uses its own identifier, __id__, as the value to be written. This is the sort of thing you’d do when using Paxos to do leader election.

role Writer:
    action Init:
        self.v = self.__id__
        self.latest_write_seen = -1
        self.quorum = genericset()

    action Phase1:
        unsent = genericset(storage_nodes)
        while is_majority(len(unsent)):
            node = any unsent
            response = node.read_and_advance_clock(self.clock)
            (clock_advanced, previous_write) = response
            unsent.discard(node)

            require clock_advanced
            atomic:
                self.quorum.add(node)
                if previous_write and previous_write.ts > self.latest_write_seen:
                    self.latest_write_seen = previous_write.ts
                    self.v = previous_write.v

    action Phase2:
        require is_majority(len(self.quorum))
        for node in self.quorum:
            node.write(self.clock, self.v)

One thing that isn’t obvious is that there’s a variable named clock that gets automatically injected into the role when the instance is created in the top-level Init action:

action Init:
    writers = []
    ...
    for i in range(NUM_WRITERS):
        writers.append(Writer(clock=i))

This is how I ensured that each writer had a unique timestamp associated with it.

StorageNode

The storage node needs to support two RPC calls, one for each of the write phases:

  1. read_and_advance_clock
  2. write

It also has a helper function named notify_readers, which does the reader broadcast.

role StorageNode:
    action Init:
        self.local_writes = genericset()
        self.clock = -1

    func read_and_advance_clock(clock):
        if clock > self.clock:
            self.clock = clock

        latest_write = None

        if self.local_writes:
            latest_write = max(self.local_writes, key=lambda w: w.ts)
        return (self.clock == clock, latest_write)


    atomic func write(ts, v):
        # request's timestamp must be later than our clock
        require ts >= self.clock

        w = record(ts=ts, v=v)
        self.local_writes.add(w)
        self.record_history_variables(w)

        self.notify_readers(w)

    func notify_readers(write):
        for r in readers:
            r.publish(self.__id__, write)

There’s a helper function I didn’t show here called record_history_variables, which I defined to record some data I needed for checking invariants, but isn’t important for the algorithm itself.

Reader

Here’s my FizzBee model for a reader. Note how it supports one RPC call, named publish.

role Reader:
    action Init:
        self.value = None
        self.tallies = genericmap()
        self.seen = genericset()

    # receive a publish event from a storage node
    atomic func publish(node_id, write):
        # Process a publish event only once per (node_id, write) tuple
        require (node_id, write) not in self.seen
        self.seen.add((node_id, write))

        self.tallies.setdefault(write, 0)
        self.tallies[write] += 1
        if is_majority(self.tallies[write]):
            self.value = write.v

Generating interesting visualizations

I wanted to generate a trace where there a quorum write succeeded but not all nodes wrote the same value.

I defined an invariant like this:

always assertion NoTwoNodesHaveDifferentWrittenValues:
    # we only care about cases where consensus was reached
    if len(written_values)==0:
        return True
    s = set([max(node.local_writes, key=lambda w: w.ts).v for node in storage_nodes if node.local_writes])
    return len(s)<=1

Once FizzBee found a counterexample, I used it to generate the following visualizations:

Sequence diagram generated by FizzBee
State of the model generated by FizzBee

General observations

I found that FizzBee was a good match for modeling Paxos. FizzBee’s roles mapped nicely onto the roles described in Paxos Made Simple, and the phases mapped nicely onto FizzBee’s action. FizzBee’s first-class support for RPC made the communication easy to implement.

I also appreciated the visualizations that FizzBee generated. I found both the sequence diagrams of the model state diagram useful as I was debugging my model.

Finally, I learned a lot more about how Paxos works by going through the exercise of modeling it, as well as writing this blog post to explain it. When it comes to developing a better understanding of an algorithm, there’s no substitute for the act of building a formal model of it and then explaining your model to someone else.

Locks, leases, fencing tokens, FizzBee!

FizzBee is a new formal specification language, originally announced back in May of last year. FizzBee’s author, Jayaprabhakar (JP) Kadarkarai, reached out to me recently and asked me what I thought of it, so I decided to give it a go.

To play with FizzBee, I decided to model some algorithms that solve the mutual exclusion problem, more commonly known as locking. Mutual exclusion algorithms are a classic use case for formal modeling, but here’s some additional background motivation: a few years back, there was an online dust-up between Martin Kleppmann (author of the excellent book Designing Data-Intensive Applications, commonly referred to as DDIA) and Salvatore Sanfilippo (creator of Redis, and better known by his online handle antirez). They were arguing about the correctness of an algorithm called Redlock that claims to achieve fault-tolerant distributed locking. Here are some relevant links:

As a FizzBee exercise, I wanted to see how difficult it was to model the problem that Kleppmann had identified in Redlock.

Keep in mind here that I’m just a newcomer to the language writing some very simple models as a learning exercise.

Critical sections

Here’s my first FizzBee model, it models the execution of two processes, with an invariant that states that at most one process can be in the critical section at a time. Note that this model doesn’t actually enforce mutual exclusion, so I was just looking to see that the assertion was violated.

# Invariant to check
always assertion MutualExclusion:
    return not any([p1.in_cs and p2.in_cs for p1 in processes
                                          for p2 in processes
                                          if p1 != p2])
NUM_PROCESSES = 2

role Process:
    action Init:
        self.in_cs = False

    action Next:
        # before critical section
        pass

        # critical section
        self.in_cs = True
        pass

        # after critical section
        self.in_cs = False
        pass

action Init:
    processes = []
    for i in range(NUM_PROCESSES):
        processes.append(Process())

The “pass” statements are no-ops, I just use them as stand-ins for “code that would execute before/during/after the critical section”.

FizzBee is built on Starlark, which is a subset of Python, which why the model looks so Pythonic. Writing a FizzBee model felt like writing a PlusCal model, without the need for specifying labels explicitly, and also with a much more familiar syntax.

The lack of labels was both a blessing and a curse. In PlusCal, the control state is something you can explicitly reference in your model. This is useful for when you want to specify a critical section as an invariant. Because FizzBee doesn’t have labels, I had to create a separate variable called “in_cs” to be able to model when a process was in its critical section. In general, though, I find PlusCal’s label syntax annoying, and I’m happy that FizzBee doesn’t require it.

FizzBee has an online playground: you can copy the model above and paste it directly into the playground and click “Run”, and it will tell you that the invariant failed.

FAILED: Model checker failed. Invariant:  MutualExclusion

The “Error Formatted” view shows how the two processes both landed on line 17, hence violating mutual exclusion:

Locks

Next up, I modeled locking in FizzBee. In general, I like to model a lock as a set, where taking the lock means adding the id of the process to the set, because if I need to, I can see:

  • who holds the lock by the elements of the set
  • if two processes somehow manage to take the same lock (multiple elements in the set)

Here’s my FizzBee mdoel:

always assertion MutualExclusion:
    return not any([p1.in_cs and p2.in_cs for p1 in processes
                                          for p2 in processes
                                          if p1 != p2])

NUM_PROCESSES = 2

role Process:
    action Init:
        self.in_cs = False

    action Next:
        # before critical section
        pass

        # acquire lock
        atomic:
            require not lock
            lock.add(self.__id__)

        #
        # critical section
        #
        self.in_cs = True
        pass
        self.in_cs = False

        # release lock
        lock.clear()

        # after critical section
        pass

action Init:
    processes = []
    lock = set()
    in_cs = set()
    for i in range(NUM_PROCESSES):
        processes.append(Process())

By default, each statement in FizzBee is treated atomically, and you can specify an atomic block to treat multiple statements automatically.

If you run this in the playground, you’ll see that the invariant holds, but there’s a different problem: deadlock

DEADLOCK detected
FAILED: Model checker failed

FizzBee’s model checker does two things by default:

  1. Checks for deadlock
  2. Assumes that a thread can crash after any arbitrary statement

In the “Error Formatted” view, you can see what happened. The first process took the lock and then crashed. This leads to deadlock, because the lock never gets released.

Leases

If we want to build a fault-tolerant locking solution, we need to handle the scenario where a process fails while it owns the lock. The Redlock algorithm uses the concept of a lease, which is a lock that expires after a period of time.

To model leases, we now need to model time. To keep things simple, my model assumes a global clock that all processes have access to.

NUM_PROCESSES = 2
LEASE_LENGTH = 10


always assertion MutualExclusion:
    return not any([p1.in_cs and p2.in_cs for p1 in processes
                                          for p2 in processes
                                          if p1 != p2])

action AdvanceClock:
    clock += 1

role Process:
    action Init:
        self.in_cs = False

    action Next:
        atomic:
            require lock.owner == None or \
                    clock >= lock.expiration_time
            lock = record(owner=self.__id__,
                          expiration_time=clock+LEASE_LENGTH)

        # check that we still have the lock
        if lock.owner == self.__id__:
            # critical section
            self.in_cs = True
            pass
            self.in_cs = False

            # release the lock
            if lock.owner == self.__id__:
                lock.owner = None

action Init:
    processes = []
    # global clock
    clock = 0
    lock = record(owner=None, expiration_time=-1)
    for i in range(NUM_PROCESSES):
        processes.append(Process())

Now the lock has an expiration date, so we don’t have the deadlock problem anymore. But the invariant is no longer always true.

FizzBee also has a neat view called the “Explorer” where you can step through and see how the state variables change over time. Here’s a screenshot, which shows the problem:

The problem is that one process can think it holds the lock, but it the lock has actually expired, which means another process can take the lock, and they can both end up in the critical section.

Fencing tokens

Kleppmann noted this problem with Redlock, that it was vulnerable to issues where a process’s execution could pause for some period of time (e.g., due to garbage collection). Kleppmann proposed using fencing tokens to prevent a process from accessing a shared resource with an expired lock.

Here’s how I modeled fencing tokens:

NUM_PROCESSES = 2
LEASE_LENGTH = 10

always assertion MutualExclusion:
    return not any([p1.in_cs and p2.in_cs for p1 in processes
                                          for p2 in processes
                                          if p1 != p2])

atomic action AdvanceClock:
    clock += 1

role Process:
    action Init:
        self.in_cs = False

    action Next:
        atomic:
            require lock.owner == None or \
                    clock >= lock.expiration_time
            lock = record(owner=self.__id__,
                          expiration_time=clock+LEASE_LENGTH)
            self.token = next_token
            next_token += 1

        # can only enter the critical section
        # if we have the highest token seen so far
        atomic:
            if self.token > last_token_seen:
                last_token_seen = self.token

                # critical section
                self.in_cs = True
                pass

        # after critical section
        self.in_cs = False

        # release the lock
        atomic:
            if lock.owner == self.__id__:
                lock.owner = None

action Init:
    processes = []
    # global clock
    clock = 0

    next_token = 1
    last_token_seen = 0
    lock = record(owner=None, expiration_time=-1)
    for i in range(NUM_PROCESSES):
        processes.append(Process())

However, if you run this through the model checker, you’ll discover that the invariant is also violated!

It turns out that fencing tokens don’t protect against the scenario where two processes both believe they hold the lock, and the lower token reaches the shared resource before the higher token:

A scenario where fencing tokens don’t ensure mutual exclusion

I reached out to Martin Kleppmann to ask about this, and he agreed that fencing tokens would not protect against this scenario.

Impressions

I found FizzBee surprisingly easy to get started with, although I only really scratched the surface here. In my case, having experience with PlusCal helped a lot, as I already knew how to write my specifications in a similar style. You can write your specs in TLA+ style, as a collection of atomic actions rather than as one big non-atomic action, but the PlusCal-style felt more natural for these particular problems I was modeling.

The Pythonic syntax will be much more familiar to programmers than PlusCal and TLA+, which should help with adoption. In some cases, though I found myself missing the conciseness of the set notation that languages like TLA+ and Alloy support. I ended up leveraging Python’s list comprehensions, which have a set-builder-notation feel to them.

Newcomers to formal specification will still have to learn how to think in terms of TLA+ style models: while FizzBee looks like Python, conceptually it is like TLA+, a notation for specifying a set of state-machine behaviors, which is very different from a Python program. I don’t know what it will be like for learners.

I was a little bit confused by FizzBee’s default behavior of a thread being able to crash at any arbitrary point, but that’s configurable, and I was able to use it to good effect to show deadlock in the lock model above.

Finally, while I read Kleppmann’s article years ago, I never noticed the issue with fencing tokens until I actually tried to model it explicitly. This is a good reminder of the value of formally specifying an algorithm. I fooled myself into thinking I understood it, but I actually hadn’t. It wasn’t until I went through the exercise of modeling it that I discovered something about its behavior that I hadn’t realized before.

Resilience: some key ingredients

Brian Marick posted on Mastodon the other day about resilience in the context of governmental efficiency. Reading that inspired me to write about some more general observations about resilience.

Now, people use the term resilience in different ways. I’m using resilience here in the following sense: how well a system is able to cope when it is pushed beyond its limits. Or, to borrow a term from safety researcher David Woods, when the system is pushed outside of its competence envelope. The technical term for this sense of the word resilience is graceful extensibility, which also comes from Woods. This term is a marriage of two other terms: graceful degradation, and software extensibility.

The term graceful degradation refers to the behavior of a system which, when it experiences partial failures, can still provide some functionality, even though it’s at a reduced fidelity. For example, for a web app, this might mean that some particular features are unavailable, or that some percentage of users are not able to access the site. Contrast this with a system that just returns 500 errors for everyone whenever something goes wrong.

We talk about extensible software systems as ones that have been designed to make it easy to add new features in the future that were not originally anticipated. A simple example of software extensibility is the ability for old code to call new code, with dynamic binding being one way to accomplish this.

Now, putting those two concepts together, if a system encounters some sort of shock that it can’t handle, and the system has the ability to extend itself so that it can now handle the shock, and it can make these changes to itself quickly enough that it minimizes the harms resulting from the shock, then we say the system exhibits graceful extensibility. And if it can keep extending itself each time it encounters a novel shock, then we say that the system exhibits sustained adaptability.

The rest of this post is about the preconditions for resilience. I’m going to talk about resilience in the context of dealing with incidents. Note that all of the topics described below come from the resilience engineering literature, although I may not always use the same terminology.

Resources

As Brian Marick observed in his toot:

As we discovered with Covid, efficiency is inversely correlated with resilience.

Here’s a question you can ask anyone who works in the compute infrastructure space: “How hot do you run your servers?” Or, even more meaningfully, “How much headroom do your servers have?”

Running your servers “hotter” means running at a higher CPU utilization. This means that you pack more load on fewer servers, which is more efficient. The problem is that the load is variable, which means that the hotter you run the servers, the more likely your server will get overloaded if there is a spike in utilization. An overloaded server can lead to an incident, and incidents are expensive! Running your servers at maximum utilization is running with zero headroom. We deliberately run our servers with some headroom to be able to handle variation in load.

We also see the idea of spare resources in what we call failover scenarios, where there’s a failure in one resource so we switch to using a different resource, such as failing over a database from primary to secondary, or even failing out of a geographical region.

The idea of spare resources is more general than hardware. It applies to people as well. The equivalent of headroom for humans is what Tom DeMarco refers to as slack. The more loaded humans are, the less well positioned they are to handle spikes in their workload. Stuff falls through the cracks when you’ve got too much load, and some of that stuff contributes to incidents. We can also even keep people in reserve for dealing with shocks, such as when an organization staffs a dedicated incident management team.

A common term that the safety people use for spare resources is capacity. I really like the way Todd Conklin put it on his Pre-Accident Investigation Podcast: “You don’t manage risk. You manage the capacity to absorb risk.” Another way he put it is “Accidents manage you, so what you really manage is the capacity for the organization to fail safely.”

Flexibility

Here’s a rough and ready definition of an incident: the system has gotten itself into a bad state, and it’s not going to return to a good state unless somebody does something about it.

Now, by this definition, for the system to become healthy again something about how the system works has to change. This means we need to change the way we do things. The easier it is to make changes to the system, the easier it will be to resolve the incident.

We can think of two different senses of changing the work of the system: the human side and the the software side.

Humans in a system are constrained by a set of rules that exist to reduce risk. We don’t let people YOLO code from their laptops into production, because of a number of risks that would expose us to. But incidents create scenarios where the risks associated with breaking these rules are lower than the risks associated with prolonging the incident. As a consequence, people in the system need the flexibility to be able to break the standard rules of work during an incident. One way to do this is to grant incident responders autonomy, let them make judgments about when they are able to break the rules that govern normal work, in scenarios where breaking the rule is less risky than following it.

Things look different on the software side, where all of the rules are mechanically enforced. For flexibility in software, we need to build into the software functionality in advance that will let us change the way the system behaves. My friend Aaron Blohowiak uses the term Jefferies tubes from Star Trek to describe features that support making operational changes to a system. These were service crawlways that made it easier for engineers to do work on the ship.

A simple example of this type of operational flexibility is putting in feature flags that can be toggled dynamically in order to change system behavior. At the other extreme is the ability to bring up a REPL on a production system in order to make changes. I’ve seen this multiple times in my career, including watching someone use the rails console command of a Ruby on Rails app to resolve an issue.

The technical term in resilience engineering for systems that possess this type of flexibility is adaptive capacity: the system has built up the ability to be able to dynamically reconfigure itself, to adapt, in order to meet novel challenges. This is where the name Adaptive Capacity Labs comes from.

Expertise

In general, organizations push against flexibility because it brings risk. In the case where I saw someone bring up a Ruby on Rails console, I was simultaneously impressed and terrified: that’s so dangerous!

Because flexibility carries risk, we need to rely on judgment as to whether the risk of leveraging the flexibility outweighs the risk of not using the flexibility to mitigate the incident. Granting people the autonomy to make those judgment calls isn’t enough: the people making the calls need to be able to make good judgment calls. And for that, you need expertise.

The people making these calls are having to make decisions balancing competing risks while under uncertainty and time pressure. In addition, how fluent they are with the tools is a key factor. I would never trust a novice with access to a REPL in production. But an expert? By definition, they know what they’re doing.

Diversity

Incidents in complex systems involve interactions between multiple parts of the system, and there’s no one person in your organization who understands the whole thing. To be able to effectively know what to do during an incident, you need to bring in different people who understand different parts of the system in order to help figure out what happens. You need diversity in your responders, people with different perspective on the problem at hand.

You also want diversity in diagnostic and mitigation strategy. Some people might think about recent changes, others might think about traffic pattern changes, others might dive into the codebase looking for clues, and yet others might look to see if there’s another problem going on right now that seems to be related. In addition, it’s often not obvious what the best course of action is to mitigate an incident. Responders often pursue multiple courses of action in parallel, hoping that at least one of them will bring the system healthy again. A diversity of perspectives can help generate more potential interventions, reducing the time to resolve.

Coordination

Having a group of experts with a diverse set of perspectives by itself isn’t enough to deal with an incident. For a system to be resilient, the people within the system need to be able to coordinate, to work together effectively.

If you’ve ever dealt with a complex incident, you know how challenging coordination can be. Things get even hairier in our distributed world. Whether you’re physically located with all of the responders, you’re on a Zoom call (a bridge, as we still say), you’re messaging over Slack, or some hybrid combination of all three, each type of communication channel has its benefits and drawbacks.

There are prescriptive approaches to improving coordination during incidents, such as the Incident Command System (ICS). However, Laura Maguire’s research has shown that, in practice, incident responders intentionally deviate from ICS to better manage coordination costs. This is yet another example of flexibility and expertise being employed to deal with an incident.


The next time you observe an incident, or you reflect on an incident where you were one of the responders, think back on to what extent these ingredients were present or absent. Were you able to leverage spare resources, or did you suffer from not being to? Were there operational changes that people wanted to be able to make during the incident, and were they actually able to make them? Were the responders experienced with the sub-systems they were dealing with, and how did that shape their responses? Did different people come up with different hypotheses and strategies? What is it clear to you what the different responders were doing during the incident? These issues are easy to miss if you’re not looking for them. But, once you internalize them, you’ll never be able to unsee them.

You’re missing your near misses

FAA data shows 30 near-misses at Reagan Airport – NPR, Jan 30, 2025

The amount of attention an incident gets is proportional to the severity of the incident: the greater the impact to the organization, the more attention that post-incident activities will get. It’s a natural response, because the greater the impact, the more unsettling it is to people: they worry very specifically about that incident recurring, and want to prevent that from happening again.

Here’s the problem: most of your incidents aren’t going to repeat incidents. Nobody wants an incident to recur, and so there’s a natural built-in mechanism for engineering teams to put in the effort to do preventative work. The real challenge is preventing and quickly mitigating novel future incidents, which is the overwhelming majority of your incidents.

And that brings us to near misses, those operational surprises that have no actual impact, but could have been a major incident if conditions were slightly different. Think of them as precursors to incidents. Or, if you are more poetically inclined, omens.

Because most of our incidents are novel, and because near misses are a source of insight about novel future incidents, if we are serious about wanting to improve reliability, we should be treating our near misses as first-class entities, the way we do with incidents. Yet, I’d wager that there are no tech companies out there today that would put the same level of effort into a near miss as they would to a real incident. I’d love to hear about a tech company that holds near miss reviews, but I haven’t heard any yet.

There are real challenges to treating near misses as first-class. We can generally afford to spend a lot of post-incident effort on each high-severity incident, because there generally aren’t that many of them. I’m quite confident that your org encounters many more near misses than it does high-severity incidents, and nobody has the cycles to put in the same level of effort for every near-miss as they do for every high severity incident. This means that we need to use judgment. We can’t use severity of impact to guide us here, because these near misses are, by definition, zero severity. We need to identify which near misses are worth examining further, and which ones to let go. It’s going to be a judgment call about how much we think we could potentially learn from looking further.

The other challenge is just surfacing these near misses. Because they are zero impact, it’s likely that only a handful of people in the organization are aware when a near miss happens. Treating near misses as first class events requires a cultural shift in an organization, where the people who are aware of them highlight the near miss as a potential source of insight for improving reliability. People have to see the value in sharing when these happens, it has to be rewarded or it won’t happen.

These near misses are happening in your organization right now. Some of them will eventually blossom into full-blown high-severity incidents. If you’re not looking for them, you won’t see them.

The danger of overreaction

The California-based blogger Kevin Drum has a good post up today with the title Why don’t we do more prescribed burning? An explainer. There’s a lot of great detail in the post, but the bit that really jumped out at me was the history of the enormous forest fires that burned in Yellowstone National Park in 1988.

Norris Geyser Basin in Yellowstone National Park, August 20, 1988
By Jeff Henry – National Park Service archives, Public Domain

In 1988 the US Park Service allowed several lightning fires to burn in Yellowstone, eventually causing a conflagration that consumed over a million acres. Public fury was intense. In a post-mortem after the fire:

The team reaffirmed the fundamental importance of fire’s natural role but recommended that fire management plans be strengthened…. Until new fire management plans were prepared, the Secretaries suspended all prescribed natural fire programs in parks and wilderness areas.

This, in turn, made me think about the U.S. government’s effort to vaccinate the population against a potential swine flu epidemic in 1976, under the Gerald Ford administration.

Gerald Ford receiving swine flu vaccine
By David Hume Kennerly – Gerald R. Ford Presidential Library: B1874-07A, Public Domain

The vaccination effort did not go well, as recounted by the historian George Dehner in the journal article WHO Knows Best? National and International Responses to Pandemic Threats and the “Lessons” of 1976

The Swine Flu Program was marred by a series of logistical problems ranging from the production of the wrong vaccine strain to a confrontation over liability protection to a temporal connection of the vaccine and a cluster of deaths among an elderly population in Pittsburgh. The most damning charge against the vaccination program was that the shots were correlated with an increase in the number of patients diagnosed with an obscure neurological disease known as Guillain–Barré syndrome. The program was halted when the statistical increase was detected, but ultimately the New York Times labeled the program a “fiasco” because the feared pandemic never appeared.

Fortunately, swine flu didn’t become an epidemic, but it’s easy to imagine an alternative history where the epidemic materialized. In that scenario, the U.S. population would have suffered because the vaccination program was stopped. I don’t know how this experience shaped the minds of policymakers at the U.S. Centers for Disease Control (CDC), but I can certainly imagine the memories of the swine flu “fiasco” influencing of the calculus of how early to start pushing for a vaccine. After all, look what happened when we tried to head off a potential pandemic last time?

When a high-severity incident happens, its associated risks becomes salient: the incident looms large in our mind, and the fact that it just happened leads us to believe that the risk of a similar incident is very high. Suddenly, folks who normally extol the virtues of being data-driven are all too comfortable extrapolating from a single data point. But this tendency to fixate on a particular risk is dangerous, for the following two reasons:

  1. We continually face a multitude of risks, not just a single one.
  2. Risks trade off of each other.

We don’t deal in an individual risk but with a vast and ever-growing menu of risks. At best, when we focus on only risk, we pay the opportunity cost of neglecting the other ones. Attention is a precious resource, and focusing our attention on one particular risk means, necessarily, that we will neglect other risks.

But it’s even worse than that. In our effort to drive down a risk that just manifested as an incident, we end up increasing risk of a future incident. Fire suppression is a clear example of how an action taken to reduce risk can increase increase risk.

As Richard Cook noted, all practitioner actions are gambles. We don’t get to choose between “more safe” and “less safe”. The decisions we make always carry risk because of the uncertainties: we just can’t predict the future well enough to understand how our actions will reshape the risks. Remember that the next time people rush to address the risks exposed by the last major incident. Because the fact that an incident just happened does not improve your ability to predict the future, no matter how severe that incident was. All of those other risks are still out there, waiting to manifest as different incidents altogether. Your actions might even end up making those future incidents worse.

Whither dashboard design?

The sorry state of dashboards

It’s true: the dashboards we use today for doing operational diagnostic work are … let’s say suboptimal. Charity Majors is one of the founders of Honeycomb, one of the newer generation of observability tools. I’m not a Honeycomb user myself, so I can’t say much intelligently about the product. But my naive understanding is that the primary way an operator interacts with Honeycomb is by querying it. And it sounds like a very nifty tool for doing that, I’ve certainly felt the absence of being able do high-cardinality queries when trying to narrow down where a problem is, and I would love to have access to a tool like that.

But we humans didn’t evolve to query our environment, we evolved to navigate it, and we have a very sophisticated visual system to help us navigate a complex world. Honeycomb does leverage the visual system by generating visualizations, but you submit the query first, and then you get the visualization.

In principle, a well-designed dashboard would engage our visual system immediately: look first, get a clue about where to look next, and then take the next diagnostic step, whether that’s explicitly querying, or navigating to some other visualization. The problem, which Charity illustrates in her tweet, is that we consistently design our dashboards poorly. Given how much information is potentially available to us, we aren’t good at designing dashboards that work well with our human brains to help us navigate all of that information.

Dashboard research of yore

Now, back in the 80s and 90s, for many physical systems that were supervised by operators (think: industrial control systems, power plants, etc.), dashboards was all they had. And there was some interesting cognitive systems engineering research back then about how to design dashboards that took into account what we knew about the human perceptual and cognitive systems.

For example, there was a proposed approach for designing user interfaces for operators called ecological interface design, by Kim Vicente and Jens Rasmussen. Vicente and Rasmussen were both engineering researchers who worked in human factors (Vicente’s background was in industrial and mechanical engineering, Rasmussen’s in electronic engineering). They co-wrote an excellent paper titled Ecological Interface Design: Theoretical Foundations. Ecological Interface Design builds on Rasmussen’s previous work on the abstraction hierarchy, which he developed based on studying how technicians debugged electronic circuits. It also builds on his skills, rules, and knowledge (SRK) framework.

More tactically, David Woods published a set of concepts to better leverage the visual system called visual momentum. These concepts including supporting check-reads (at-a-glance information), longshots, perceptual landmarks, and display overlaps. For more details, see the papers Visual Momentum: A Concept to Improve the Cognitive Coupling of Person and Computer and How Not to Have to Navigate Through Too Many Displays.

What’s the state of dashboard design today?

I’m not aware of anyone in our industry working on the “how do we design better dashboards?” question today. As far as I can tell, discussions around observability these days center more around platform-y questions, like:

  • What kinds of observability data should we collect?
  • How should we store it?
  • What types of queries should we support?

For example, here’s Charity Majors, on “Observability 2.0: How do you debug?“, on the third bullet (emphasis mine):

You check your instrumentation, or you watch your SLOs. If something looks off, you see what all the mysterious events have in common, or you start forming hypotheses, asking a question, considering the result, and forming another one based on the answer. You interrogate your systems, following the trail of breadcrumbs to the answer, every time.

You don’t have to guess or rely on elaborate, inevitably out-of-date mental models. The data is right there in front of your eyes. The best debuggers are the people who are the most curious.

Your debugging questions are analysis-first: you start with your user’s experience.

I’d like to see our industry improve the check your instrumentation part of that to make it easier to identify if something looks off, providing cues about where to look next. To be explicit:

  1. I always want the ability to query my system in the way that Honeycomb supports, with high-cardinality drill-down and correlations.
  2. I always want to start off with a dashboard, not a query interface

In other words, I always want to start off with a dashboard, and use that as a jumping-off point to do queries.

And, maybe there are folks out there in observability-land working on how to improve dashboard design. But, if so, I’m not aware of that work. Just looking at the schedule from Monitorama 2024, the word “dashboard” does not even appear at once.

And that makes me sad. Because, while not everyone has access to tooling like Honeycomb, everyone has access to dashboards. And the state-of-the-dashboard doesn’t seem like it’s going to get any better anytime soon.

The Canva outage: another tale of saturation and resilience

Today’s public incident writeup comes courtesy of Brendan Humphries, the CTO of Canva. Like so many other incidents that came before, this is another tale of saturation, where the failure mode involves overload. There’s a lot of great detail in Humpries’s write-up, and I recommend you read it directly in addition to this post.

What happened at Canva

Trigger: deploying a new version of a page

The trigger for this incident was Canva deploying a new version of their editor page. It’s notable that there was nothing wrong with this new version. The incident wasn’t triggered by a bug in the code in the new version, or even by some unexpected emergent behavior in the code of this version. No, while the incident was triggered by a deploy, the changes from the previous version are immaterial to this outage. Rather, it was the system behavior that emerged from clients downloading the new version that led to the outage. Specifically, it was clients downloading the new javascript files from the CDN that set the ball in motion.

A stale traffic rule

Canva uses Cloudflare as their CDN. Being a CDN, Cloudflare has datacenters all over the world., which are interconnected by a private backbone. Now, I’m not a networking person, but my basic understanding of private backbones is that CDNs lease fibre-optic lines from telecom companies and use these leased lines to ensure that they have dedicated network connectivity and bandwidth between their sites.

Unfortunately for Canva, there was a previously unknown issue on Cloudflare’s side: Cloudflare Wasn’t using their dedicated fibre-optic lines to route traffic between their Northern Virginia and Singapore datacenters. That traffic was instead, unintentionally, going over the public internet.

[A] stale rule in Cloudflare’s traffic management system [that] was sending user IPv6 traffic over public transit between Ashburn and Singapore instead of its default route over the private backbone.

Traffic between Northern Virginia (IAD) and Singapore (SIN) was incorrectly routed over the public network

The routes that this traffic took suffered from considerable packet loss. For Canva users in Asia, this meant that they experienced massive increases in latency when their web browsers attempted to fetch the javascript static assets from the CDN.

A stale rule like this is the kind of issue that the safety researcher James Reason calls a latent pathogen. It’s a problem that remains unnoticed until it emerges as a contributor to an incident.

High latency synchronizes the callers

Normally, an increase in errors would cause our canary system to abort a deployment. However, in this case, no errors were recorded because requests didn’t complete. As a result, over 270,000+ user requests for the JavaScript file waited on the same cache stream. This created a backlog of requests from users in Southeast Asia.

The first client attempts to fetch the new Javascript files from the CDN, but the files aren’t there yet, the CDN must fetch the files from the origin. Because of the added latency, this takes a long time.

During this time, other clients connect, and attempt to fetch the javascript from the CDN. But the CDN has not yet been populated with the files from the origin, that transfer is still in progress.

As Cloudflare notes in this blog post, when all subsequent clients request access to a file that is in the process of being populated in the cache, they must wait until the file has been cached before they can download the file. Except that Cloudflare has implemented functionality called Concurrent Streaming Acceleration which permits multiple clients to simultaneously download a file that is still in the process of being downloaded from the origin server.

The resulting behavior is that the CDN now behaves effectively as a barrier, with all of the clients slowly but simultaneously downloading the assets. With a traditional barrier, the processes who are waiting can proceed once all processes have entered in the barrier. This isn’t quite the same, as the clients who are waiting can all proceed once the CDN completes downloading the asset from the origin.

The transfer completes, the herd thunders

At 9:07 AM UTC, the asset fetch completed, and all 270,000+ pending requests were completed simultaneously.

20 minutes after Canva deployed the new Javascript assets to the origin server, the clients completed fetching them. The next action the clients take is to call Canva’s API service.

With the JavaScript file now accessible, client devices resumed loading the editor, including the previously blocked object panel. The object panel loaded simultaneously across all waiting devices, resulting in a thundering herd of 1.5 million requests per second to the API Gateway — 3x the typical peak load.

There’s one more issue that made this situation even worse: a known performance issue in the API gateway that was slated to be fixed.

A problematic call pattern to a library reduces service throughput

The API Gateways use an event loop model, where code running on event loop threads must not perform any blocking operations. 

Two common threading models for request-response services are thread-per-request and async. For services that are I/O-bound (i.e., most of the time servicing each request is spent waiting for I/O operations to complete, typically networking operations), the async model has the potential to achieve better throughput. That’s because the concurrency of the thread-per-request model is limited by the number of operating-system threads. The async model services multiple requests per thread, and so it doesn’t suffer from the thread bottleneck. Canva’s API gateway implements the async model using the popular Netty library.

One of the drawbacks of the async model is the risk associated with the active thread getting blocked, because this can result in a significant performance penalty. The async model multiplexes multiple requests across an individual thread, and none of those requests can make progress when that thread is blocked. Programmers writing code in a service that uses the async model need to take care to minimize the number of blocking calls.

Prior to this incident, we’d made changes to our telemetry library code, inadvertently introducing a performance regression. The change caused certain metrics to be re-registered each time a new value was recorded. This re-registration occurred under a lock within a third-party library.

In Canva’s case, the API gateway logic was making calls to a third-party telemetry library. They were calling the library in such a way that it took a lock, which is a blocking call. This reduced the effective throughput that the API gateway could handle.

Calls to the library led to excessive thread locking

Although the issue had already been identified and a fix had entered our release process the day of the incident, we’d underestimated the impact of the bug and didn’t expedite deploying the fix. This meant it wasn’t deployed before the incident occurred.

Ironically, they were aware of this problematic call pattern, and they were planning on deploying a fix the day of the incident(!).

As an aside, it’s worth noting the role of telemetry logic behavior in the recent OpenAI incident, and in the locking behavior of tracing library in a complex performance issue that Netflix experienced. Observability giveth reliability, and observability taketh reliability away.

Canva is now in a situation where the API gateway is receiving much more traffic than it was provisioned to handle, is also suffering from a performance regression that reduces its ability to handle traffic even more.

Now let’s look at how the system behaved under these conditions.

The load balancer turns into an overload balancer

Because the API Gateway tasks were failing to handle the requests in a timely manner, the load balancers started opening new connections to the already overloaded tasks, further increasing memory pressure.

A load balancer sits in front of a service and distributes the incoming requests across the units of compute. Canva runs atop ECS, so the individual units are called tasks, and the group is called a cluster (you can think of these as being equivalent to pods and replicasets in Kubernetes-land).

The load balancer will only send requests to a task that is healthy. If a task is unhealthy, then it stops being considered as a candidate target destination for the load balancer. This yields good results if the overall cluster is provisioned to handle the load: the traffic gets redirected away from the unhealthy tasks and onto the healthy ones.

Load balancer only sells traffic to the healthy tasks

But now consider the scenario where all of the tasks are operating close to capacity. As tasks go unhealthy, the load balancer will redistribute the load to the remaining “healthy” tasks, which increases the likelihood those tasks gets pushed into an unhealthy state.

Redirecting traffic to the almost-overloaded healthy nodes will push them over

This is a classic example of a positive feedback loop: the more tasks go unhealthy, the more traffic the healthy nodes received, the more likely those tasks will go unhealthy as well.

Autoscaling can’t keep pace

So, now the system is saturated, and the load balancer is effectively making things worse. Instead of shedding load, it’s concentrating load on the tasks that aren’t overloaded yet.

Now, this is the cloud, and the cloud is elastic, and we have a wonderful automation system called the autoscaler that can help us in situations of overload by automating provisioning new capacity.

Only, there’s a problem here, and that’s that the autoscaler simply can’t scale up fast enough. And the reason it can’t scale up fast enough is because of another automation system that’s intended to help in times of overload: Linux’s OOM killer.

The growth of off-heap memory caused the Linux Out Of Memory Killer to terminate all of the running containers in the first 2 minutes, causing a cascading failure across all API Gateway tasks. This outpaced our autoscaling capability, ultimately leading to all requests to canva.com failing.

Operating systems need access to free memory in order to function properly. When all of the memory is consumed by running processes, the operating system runs into trouble. To guard against this, Linux has a feature called the OOM killer which will automatically terminate a process when the operating system is running too low on memory. This frees up memory, enabling the OS to keep functioning.

So, you have the autoscaler which is adding new tasks, and the OOM killer which is quickly destroying existing tasks that have become overloaded.

It’s notable that Humphries uses the term outpaced. This sort of scenario is a common failure mode in complex system failures, where the system gets into a state where it can’t keep up. This phenomenon is called decompensation. Here’s resilience engineering pioneer David Woods describing decompensation on John Willis’s Profound Podcast:

And lag is really saturation in time. That’s what we call decompensation, right? I can’t keep pace, right? Events are moving forward faster. Trouble is building and compounding faster than I, than the team, than the response system can decide on and deploy actions to affect. So I can’t keep pace. – David Woods

Adapting the system to bring it back up

At this point, the API gateway cluster is completely overwhelmed. From the timeline:

9:07 AM UTC – Network issue resolved, but the backlog of queued requests result in a spike of 1.5 million requests per second to the API gateway.

9:08 AM UTC – API Gateway tasks begin failing due to memory exhaustion, leading to a full collapse.

When your system is suffering from overload, there are basically two strategies:

  1. increase the capacity
  2. reduce the load

Wisely, the Canva engineers pursued both strategies in parallel.

Max capacity, but it still isn’t enough

Montgomery Scott, my nominee for patron saint of resilience engineering

 We attempted to work around this issue by significantly increasing the desired task count manually. Unfortunately, it didn’t mitigate the issue of tasks being quickly terminated.

The engineers tried to increase capacity manually, but even with the manual scaling, the load was too much: the OOM killer was taking the tasks down too quickly for the system to get back to a healthy state.

Load shedding, human operator edition

The engineers had to improvise a load shedding solution in the moment. The approach they took was to block traffic the CDN layer, using Cloudflare.

 At 9:29 AM UTC, we added a temporary Cloudflare firewall rule to block all traffic at the CDN. This prevented any traffic reaching the API Gateway, allowing new tasks to start up without being overwhelmed with incoming requests. We later redirected canva.com to our status page to make it clear to users that we were experiencing an incident.

It’s worth noting here that while Cloudflare contributed to this incident with the stale rule, the fact that they could dynamically configure Cloudflare firewall rules meant that Cloudflare also contributed to the mitigation of this incident.

Ramping the traffic back up

Here they turned off all of their traffic to give their system a chance to go back to healthy. But a healthy system under zero load behaves differently from a healthy system under typical load. If you just go back from zero to typical, there’s a risk that you push the system back into an unhealthy state. (One common problem is that autoscaling will have scaled down multiple services due when there’s no load).

Once the number of healthy API Gateway tasks stabilized to a level we were comfortable with, we incrementally restored traffic to canva.com. Starting with Australian users under strict rate limits, we gradually increased the traffic flow to ensure stability before scaling further.

The Canva engineers had the good judgment to ramp up the traffic incrementally rather than turn it back on all at once. They started restoring at 9:45 AM UTC, and were back to taking full traffic at 10:04 AM.

Some general observations

All functional requirements met

I always like to call out situations where, from a functional point of view, everything was actually working fine. In this case, even though there was a stale rule in the Cloudflare traffic management system, and there was a performance regression in the API gateway, everything was working correctly from a functional perspective: packets were still being routed between Singapore and Northern Virginia, and the API gateway was still returning the proper responses for individual requests before it got overloaded.

Rather, these two issues were both performance problems. Performance problems are much harder to spot, and the worst are the ones that you don’t notice until you’re under heavy load.

The irony is that, as an organization gets better at catching functional bugs before they hit production, more and more of the production incidents they face will be related to these more difficult-to-detect-early performance issues.

Automated systems made the problem worse

There were a number of automated systems in play whose behavior made this incident more difficult to deal with.

The Concurrent Streaming Acceleration functionality synchronized the requests from the clients. The OOM killer reduced the time it took for a task to be seen as unhealthy by the load balancer, and the load balancer in turn increased the rate at which tasks went unhealthy.

None of these systems were designed to handle this sort of situation, so they could not automatically change their behavior.

The human operators changed the way the system behaved

It was up to the incident responders to adapt the behavior of the system, to change the way it functioned in order to get it back to a healthy state. They were able to leverage an existing resource, Cloudflare’s firewall functionality, to accomplish this. Based on the description of the action items, I suspect they had never used Cloudflare’s firewall to do this type of load shedding before. But it worked! They successfully adapted the system behavior.

We’re building a detailed internal runbook to make sure we can granularly reroute, block, and then progressively scale up traffic. We’ll use this runbook to quickly mitigate any similar incidents in the future.

This is a classic example of resilience, of acting to reconfigure the behavior of your system when it enters a state that it wasn’t originally designed to handle.

As I’ve written about previously, Woods talks about the idea of a competence envelope. The competence envelope is sort of a conceptual space of the types of inputs that your system can handle. Incidents occur when your system is pushed to operate outside of its competence envelope, such as when it gets more load than it is provisioned to handle:

The competence envelope is a good way to think about the difference between robustness and resilience. You can think of robustness as describing the competence envelope itself: a more robust system may have a larger competence envelope, it is designed to handle a broader range of problems.

However, every system has a finite competence envelope. The difference between a resilient and a brittle system is how that system behaves when it is pushed just outside of its competence envelope.

Incidents happen when the system is pushed outside of its competence envelope

A resilient system can change the way it behaves when pushed outside of the competence envelope due to an incident in order to extend the competence envelope so that it can handle the incident. That’s why we say it has adaptive capacity. On the other hand, a brittle system is one that cannot adapt effectively when it exceeds its competence envelope. A system can be very robust, but also brittle: it may be able to handle a very wide range of problems, but when it faces a scenario it wasn’t designed to handle, it can fall over hard.

The sort of adaptation that resilience demands requires human operators: our automation simply doesn’t have a sophisticated enough model of the world to be able to handle situations like the one that Canva found itself in.

In general, action items after an incident focus on expanding the competence envelope: making changes to the system to handle the scenario that just happened. Improving adaptive capacity involves different kind of work than improving system robustness.

We need to build in the ability to reconfigure our systems in advance, without knowing exactly what sorts of changes we’ll need to make. The Canva engineers had some powerful operational knobs at their disposal through the Cloudflare firewall configuration. This allowed them to make changes. The more powerful and generic these sorts of dynamic configuration features are, the more room for maneuver we have. Of course, dynamic configuration is also dangerous, and is itself a contributor to incidents. Too often we focus solely on the dangers of such functionality in creating incidents, without seeing its ability to help us reconfigure the system to mitigate incidents.

Finally, these sorts of operator interfaces are of no use if the responders aren’t familiar with them. Ultimately, the more your responders know about the system, the better position they’ll be in to implement these adaptations. Changing an unhealthy system is dangerous: no matter how bad things are, you can always accidentally make things worse. The more knowledge about the system you can bring to bear during an incident, the better position you’ll be in to adaptive your system to extend that competence envelope.

Quick takes on the recent OpenAI public incident write-up

OpenAI recently published a public writeup for an incident they had on December 11, and there are lots of good details in here! Here are some of my off-the-cuff observations:

Saturation

With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in most of our large clusters.

The term saturation describes the condition where a system has reached the limit of what it can handle. This is sometimes referred to as overload or resource exhaustion. In the OpenAI incident, it was the Kubernetes API servers saturated because they were receiving too much traffic. Once that happened, the API servers no longer functioned properly. As a consequence, their DNS-based service discovery mechanism ultimately failed.

Saturation is an extremely common failure mode in incidents, and here OpenAI provides us with yet another example. You can also read some previous posts about public incident writeups involving saturation: Cloudflare, Rogers, and Slack.

All tests pass

The change was tested in a staging cluster, where no issues were observed. The impact was specific to clusters exceeding a certain size, and our DNS cache on each node delayed visible failures long enough for the rollout to continue.

One reason why it’s difficult to prevent saturation-related incidents is because all of the software can be functionally correct, in the sense that it passes all of the functional tests and that the failure mode only rears its ugly head once the system is exposed to conditions that only occur in the production environment. Even canarying with production traffic can’t prevent problems that only occur under full load.

Our main reliability concern prior to deployment was resource consumption of the new telemetry service. Before deployment, we evaluated resource utilization metrics in all clusters (CPU/memory) to ensure that the deployment wouldn’t disrupt running services. While resource requests were tuned on a per cluster basis, no precautions were taken to assess Kubernetes API server load. This rollout process monitored service health but lacked sufficient cluster health monitoring protocols.

It’s worth noting that the engineers did validate the change in resource utilization on the clusters where the new telemetry configuration was deployed. The problem was an interaction: it increased load on the API servers, which brings us to the next point.

Complex, unexpected interactions

This was a confluence of multiple systems and processes failing simultaneously and interacting in unexpected ways.

When we look at system failures, we often look for problems in individual components. But in complex systems, identifying the complex, unexpected interactions can yield better insights into how failures happens. You don’t just want to look at the boxes, you also want to look at the arrows.

In short, the root cause was a new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS-based service discovery.

So, we rolled out the new telemetry service, and, yada yada yada, our services couldn’t call each other anymore.”

In this case, the surprising interaction was between a failure of the kubernetes API and the resulting failure of services running on top of kubernetes. Normally, if you have services that are running on top of kubernetes and your kubernetes API goes unhealthy, your services should still keep running normally, you just can’t make changes to your current deployment (e.g., deploy new code, change the number of pods). However, in this case, a failure in the kubernetes API (control plane) ultimately led to failures in the behavior of running services (data plane).

The coupling between the two? It was DNS.

DNS

In short, the root cause was a new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS-based service discovery.

Impact of a change is spread out over time

DNS caching added a delay between making the change and when services started failing.

One of the things that makes DNS-related incidents difficult to deal with is the nature of DNS caching.

When the effect of a change is spread out over time, this can make it more difficult to diagnose what the breaking change was. This is especially true when the critical service that stopped working (in this case, service discovery) was not the thing that was changed (telemetry service deployment).

DNS caching made the issue far less visible until the rollouts had begun fleet-wide.

In this case, the effect was spread out over time because of the nature of DNS caching. But often we intentionally spread out a change over time because we want to reduce the blast radius if the change we are rolling out turns out to be a breaking change. This works well if we detect the problem during the rollout. However, this can also make it harder to detect the problem, because the error signal is smaller (by design!). And if we only detect the problem after the rollout is complete, it can be harder to correlate the change with the effect, because the change was smeared out over time.

Failure mode makes remediation more difficult

 In order to make that fix, we needed to access the Kubernetes control plane – which we could not do due to the increased load to the Kubernetes API servers.

Sometimes the failure mode that breaks systems that production depends upon also breaks systems that operators depend on to do their work. I think James Mickens said it best when he wrote:

I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS

Facebook encountered similar problems when they experienced a major outage back in 2021:

And as our engineers worked to figure out what was happening and why, they faced two large obstacles: first, it was not possible to access our data centers through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this. 

This type of problem often requires that operators improvise a solution in the moment. The OpenAI engineers pursued multiple strategies to get the system heathy again.

We identified the issue within minutes and immediately spun up multiple workstreams to explore different ways to bring our clusters back online quickly:

  1. Scaling down cluster size: Reduced the aggregate Kubernetes API load.
  2. Blocking network access to Kubernetes admin APIs: Prevented new expensive requests, giving the API servers time to recover.
  3. Scaling up Kubernetes API servers: Increased available resources to handle pending requests, allowing us to apply the fix.

By pursuing all three in parallel, we eventually restored enough control to remove the offending service.

Their interventions were successful, but it’s easy to imagine scenarios where one of these interventions accidentally made things even worse. As Richard Cook noted: all practitioner actions are gambles. Incidents always involve uncertainty in the moment, and it’s easy to overlook this when we look back with perfect knowledge of how the events unfolded.

A change intended to improve reliability

As part of a push to improve reliability across the organization, we’ve been working to improve our cluster-wide observability tooling to strengthen visibility into the state of our systems. At 3:12 PM PST, we deployed a new telemetry service to collect detailed Kubernetes control plane metrics.

This is a great example of unexpected behavior of a subsystem whose primary purpose was to improve reliability. This is another data point for my conjecture on why reliable systems fail.

Your lying virtual eyes

Well, who you gonna believe, me or your own eyes? – Chico Marx (dressed as Groucho), from Duck Soup:

In the ACM Queue article Above the Line, Below the Line, the late safety researcher Richard Cook (of How Complex Systems Fail fame) notes how that we software operators don’t interact directly with the system. Instead, we interact through representations. In particular, we view representations of internal state of the system, and we manipulate these representations in order to effect changes, to control the system. Cook used the term line of representation to describe the split between the world of the technical (software) system and the world of the people who work with the technical system. The people are above the line of representation, and the technical system is below the line.

Above the line of representation are the people, organizations, and processes that shape, direct, and restore the technical artifacts that lie below that line.People who work above the line routinely describe what is below the line using concrete, realistic language.

Yet, remarkably, nothing below the line can be seen or acted upon directly. The displays, keyboards, and mice that constitute the line of representation are the only tangible evidence that anything at all lies below the line. All understandings of what lies below the line are constructed in the sense proposed by Bruno Latour and Steve Woolgar. What we “know”—what we can know—about what lies below the line depends on inferences made from representations that appear on the screens and displays.

In short, we can never actually see or change the system directly, all of our interactions mediated through software interfaces.

René Magritte would have appreciated Cook’s article

In this post, I want to talk about how this fact can manifest as incidents, and that our solutions rarely consider this problem. Let’s start off, as we so often do in the safety world, with the Three Mile Island accident.

Three Mile Island and the indicator light

I assume the reader has some familiarity with the partial meltdown that occurred at the Three Mile Island nuclear plant back in 1979. As it happens, there’s a great series of lectures by Cook on accidents. The topic of his first lecture is about how Three Mile Island changed the way safety specialists thought about the nature of accidents.

Here I want to focus on just one aspect of this incident: a particular indicator light in the Three Mile Island control room. During this incident, there was a type of pressure relief valve called a pilot-operated relief valve (PORV) that was stuck open. However, the indicator light for the state of this valve was off, which the operators interpreted (incorrectly, alas) as the valve being closed. Here I’ll quote the wikipedia article:

A light on a control panel, installed after the PORV had stuck open during startup testing, came on when the PORV opened. When that light—labeled Light on – RC-RV2 open —went out, the operators believed that the valve was closed. In fact, the light when on only indicated that the PORV pilot valve’s solenoid was powered, not the actual status of the PORV. While the main relief valve was stuck open, the operators believed the unlighted lamp meant the valve was shut. As a result, they did not correctly diagnose the problem for several hours.

What I found notable was the article’s comment about lack of operator training to handle this specific scenario, a common trope in incident analysis.

The operators had not been trained to understand the ambiguous nature of the PORV indicator and to look for alternative confirmation that the main relief valve was closed. A downstream temperature indicator, the sensor for which was located in the tail pipe between the pilot-operated relief valve and the pressurizer relief tank, could have hinted at a stuck valve had operators noticed its higher-than-normal reading. It was not, however, part of the “safety grade” suite of indicators designed to be used after an incident, and personnel had not been trained to use it. Its location behind the seven-foot-high instrument panel also meant that it was effectively out of sight.

Now, consider what happens if the agent acting on these sensors is an automated control system instead of a human operator.

Sensors, automation, and accidents: cases from aviation

In the aviation world, we have a combination of automation and human operators (pilots) who work together in real-time. The assumption is that if something goes wrong with the automation, the human can quickly take over and deal with the problem. But automation can make things too difficult for a human to be able to compensate for, and automation can be particularly vulnerable to sensor problems, as we can see in the following accidents:

Bombardier Learjet 60 accident, 2008

On September 19, 2008, in Columbia, South Carolina, a Bombardier Learjet 60 overran the runway during a rejected takeoff. As a consequence, four people aboard the plane, including the captain and first officer, were killed. In this case, the sensor issues were due to damage to electronics in the wheel well area after underinflated tires on the landing gear exploded.

The pilots reversed thrust to slow down the plane. However, the tires on the plane were under-inflated, and they exploded. As a result of the tire explosion, sensors in the wheel well area of the plane were damaged.

The thrust reverse system relies on sensor data to determine whether reversing thrust is a safe operation. Because of the sensor damage, the system determined that it was not safe to reverse thrust, and instead increased forward thrust. From the NTSB report:

In this situation, the EECs would transition from the reverse thrust power schedule to the
forward thrust power schedule during about a 2-second transition through idle power. During the entire sequence, the thrust reverser levers in the cockpit would remain in the reverse thrust idle position (as selected by the pilot) while the engines produced forward thrust. Because both the thrust reverser levers and the forward thrust levers share common RVDTs (one for the left engine and one for the right engine), the EECs, which receive TLA information from the RVDTs, would signal the engines to produce a level of forward thrust that generally corresponds with the level of reverse thrust commanded; that is, a pilot commanding full reverse thrust (for maximum deceleration of the airplane) would instead receive high levels of forward thrust (accelerating the airplane) according to the forward thrust power schedule

(My initial source for this was John Thomas’s slides.)

Air France 447, 2009

On June 1, 2009, Air France 447 crashed, killing all passengers and crew. The plane was an Airbus A330-200. In this accident, the sensor problem is believed to be caused by ice crystals that accumulated inside of pitot tube sensors, creating a blockage which lead to erroneous readings. Here’s a quote from an excellent Vanity Fair article on the crash:

Just after 11:10 P.M., as a result of the blockage, all three of the cockpit’s airspeed indications failed, dropping to impossibly low values. Also as a result of the blockage, the indications of altitude blipped down by an unimportant 360 feet. Neither pilot had time to notice these readings before the autopilot, reacting to the loss of valid airspeed data, disengaged from the control system and sounded the first of many alarms—an electronic “cavalry charge.” For similar reasons, the automatic throttles shifted modes, locking onto the current thrust, and the fly-by-wire control system, which needs airspeed data to function at full capacity, reconfigured itself from Normal Law into a reduced regime called Alternate Law, which eliminated stall protection and changed the nature of roll control so that in this one sense the A330 now handled like a conventional airplane. All of this was necessary, minimal, and a logical response by the machine.

This is what the safety researcher David Woods refers to as bumpy transfer of control, where the humans must suddenly and unexpectedly take over control of an automated system, which can lead to disastrous consequences.

Boeing 737 MAX 8 (2018, 2019)

On October 29, 2018, Lion Air Flight 610 crashed thirteen minutes after takeoff, killing everyone on board. Five months later, on March 10, 2019, Ethiopian Airlines Flight 302 crashed six minutes after takeoff, also killing everyone on board. Both planes were Boeing 737 MAX 8. In both cases, the sensor problem was related to the angle-of-attack (AOA) sensor.

Lion Air Flight 610 investigation report:

The replacement AOA sensor that was installed on the accident aircraft had
been mis-calibrated during an earlier repair. This mis-calibration was not
detected during the repair.

Ethiopian Airline Flight 302 investigation report:

Shortly after liftoff, the left Angle of Attack sensor recorded value became erroneous and the left stick shaker activated and remained active until near the end of the recording.

An automation subsystem in the 737 MAX called Maneuvering Characteristics Augmentation System (MCAS) automatically pushed the nose down in response to the AOA sensor data.

What should we take away from these?

Here I’ve given examples from aviation, but sensor-automation problems are not specific to that domain. Here are a few of my own takeaways.

We designers can’t assume sensor data will be correct

The kinds of safety automation subsystems we build in tech are pretty much always closed-loop control systems. When designing such systems in the tech world, how often have you heard someone ask, “what happens if there’s a problem with the sensor data that the system is reacting to?”

This goes back to the line of representation problem: that no agent ever gets access to the true state of the system, it only gets access to some sort of representation. The irony here is that it doesn’t just apply to humans (above the line) making sense of signals, it also applies to technical system components (below the line!) making sense of signals from other technical components.

Designing a system that is safe in the face of sensor problems is hard

Again, from the NTSB report of the Learjet 60 crash:

Learjet engineering personnel indicated that the uncommanded stowage of the thrust reversers in the event of any system loss or malfunction is part of a fail-safe design that ensures that a system anomaly cannot result in a thrust reverser deployment in flight, which could adversely affect the airplane’s controllability. The design is intended to reduce the pilot’s emergency procedures workload and prevent potential mistakes that could exacerbate an abnormal situation.

The thrust reverser system behavior was designed by aerospace engineers to increase safety, and ended up making things worse! Good luck imagining all of these sorts of scenarios when you design your systems to increase safety.

Even humans struggle in the face of sensor problems

People are better equipped to handle sensor problems than automation, because we don’t seem to be able to build automation that can handle all of the possible kinds of sensor problems that we might throw at a problem.

But even for humans, sensor problems are difficult. While we’ll eventually figure out what’s going on, we’ll still struggle in the face of conflicting signals, as anyone who has responded to an incident can tell you. And in high-tempo situations, where we need to respond quickly enough or something terrible will happen (like in the Air France 447 case), we simply might not be able to respond quickly enough.

Instead of focusing on building the perfect fail-safe system to prevent this next time, I wish we’d spend more time thinking about, “how can we help the human figure out what the heck is happening when the input signals don’t seem to make sense”.