SRE, CSE, and the safety boundary

Site reliability engineering (SRE) and cognitive systems engineering (CSE) are two fields seeking the same goal: helping to design, build, and operate complex, software-intensive systems that stay up and running. They both worry about incidents and human workload, and they both reason about systems in terms of models. But their approaches are very different, and this post is about exploring one of those differences.

Caveat: I believe that you can’t really understand a field unless you either have direct working experience, or you have observed people doing work in the field. I’m not a site reliability engineer or a cognitive systems engineer, nor have I directly observed SREs or CSEs at work. This post is an outsider’s perspective on both of these fields. But I think it holds true to the philosophies that these approaches espouse publicly. Whether it corresponds to the actual day-to-day work of SREs and CSEs, I will leave to the judgment of the folks on the ground who actually do SRE or CSE work.

A bit of background

Site reliability engineering was popularized by Google, and continues to be strongly associated with the company. Google has published three O’Reilly books, the first one in 2016. I won’t say any more about the background of SRE here, but there are many other sources (including the Google books) for those who want to know more about the background.

Cognitive systems engineering is much older, tracing its roots back to the early eighties. If SRE is, as Ben Treynor described it what happens when you ask a software engineer to design an operations function, then CSE is what happens when you ask a psychologist how to prevent nuclear meltdowns.

CSE emerged in the wake of the Three Mile Island accident of 1979, where researchers were trying to make sense of how the accident happened. Before Three Mile Island, research on "human factors" aspects of work had focused on human physiology (for example, designing airplane cockpits), but after TMI the focused expanded to include cognitive aspects of work. The two researchers most closely associated with CSE, Erik Hollnagel and David Woods, were both trained as psychology researchers: their paper Cognitive Systems Engineering: New wine in new bottles marks the birth of the field (Thai Wood covered this paper in his excellent Resilience Roundup newsletter).

CSE has been applied in many different domains, but I think it would be unknown in the "tech" community were it not for the tireless efforts of John Allspaw to popularize the results of CSE research that has been done in the past four decades.

A useful metaphor: Rasmussen’s dynamic safety model

Jens Rasmussen was a Danish safety researcher whose work remains deeply influential in CSE. In 1997 he published a paper titled Risk management in a dynamic society: a modelling problem. This paper introduced the metaphor of the safety boundary, as illustrated in the following visual model, which I’ve reproduced from this paper:

Rasmussen viewed a safety-critical system as a point that moves inside of a space enclosed by three boundaries.

At the top right is what Rasmussen called the "boundary to economic failure". If the system crosses this boundary, then the system will fail due to poor economic performance. We know that if we try to work too quickly, we sacrifice safety. But we can’t work arbitrarily slowly to increase safety, because then we won’t get anything done. Management naturally puts pressure on the system to move away from this boundary.

At the bottom right is what Rasmussen called the "boundary of unacceptable work load". Management can apply pressure on the workforce to work both safely and quickly, but increasing safety and increasing productivity both require effort on behalf of practitioners, and there are limits to the amount of work that people can do. Practitioners naturally put pressure on the system to move away from this boundary.

At the left, the diagram has two boundaries. The outer boundary is what Rasmussen called the "boundary of functionally acceptable performance", what I’ll call the safety boundary. If the system crosses this boundary, an incident happens. We can never know exactly where this boundary is. The inner boundary is labelled "resulting perceived boundary of acceptable performance". That’s where we think the boundary is, and where we try to stay away from.

SRE vs CSE in context of the dynamic safety model

I find the dynamic safety model useful because I think it illustrates the difference in focus between SRE and CSE.

SRE focuses on two questions:

  1. How do we keep the system away from the safety boundary?
  2. What do we do once we’ve crossed the boundary?

To deal with the first question, SRE thinks about issues such as how to design systems and how to introduce changes safely. The second question is the realm of incident response.

CSE, on the other hand, focuses on the following questions:

  1. How will the system behave near the system boundary?
  2. How should we take this boundary behavior into account in our design?

CSE focuses on the space near the boundary, both to learn how work is actually done, and to inform how we should design tools to better support this work. In the words of Woods and Hollnagel:

> Discovery is aided by looking at situations that are near the margins of practice and when resource saturation is threatened (attention, workload, etc.). These are the circumstances when one can see how the system stretches to accommodate new demands, and the sources of resilience that usually bridge gaps. – Joint Cognitive Systems: Patterns in Cognitive Systems Engineering, p37

Fascinatingly, CSE has also identified common patterns of system behavior at the boundary that holds across multiple domains. But that will have to wait for a different post.

Reading more about CSE

I’m still a novice in the field of cognitive systems engineering. I’m actually using these posts to help learn through explaining the concepts to others.

The source I’ve found most useful so far is the book Joint Cognitive Systems: Patterns in Cognitive Systems Engineering , which is referenced in this post. If you prefer videos, Cook’s Lectures on the study of cognitive work is excellent.

I’ve also started a CSE reading list.

How did software get so reliable without proof?

In 1996, the Turing-award-winning computer scientist C.A.R. Hoare wrote a paper with the title How Did Software Get So Reliable Without Proof? In this paper, Hoare grapples with the observation that software seems to be more reliable than computer science researchers expected was possible without the use of mathematical proofs for verification (emphasis added):

Twenty years ago it was reasonable to predict that the size and ambition of software products would be severely limited by the unreliability of their component programs … Dire warnings have been issued of the dangers of safety-critical software controlling health equipment, aircraft, weapons systems and industrial processes, including nuclear power stations … Fortunately, the problem of program correctness has turned out to be far less serious than predicted …

So the questions arise: why have twenty years of pessimistic predictions been falsified? Was it due to successful application of the results of the research which was motivated by the predictions? How could that be, when clearly little software has ever has been subjected to the rigours of formal proof?

Hoare offers five explanations for how software became more reliable: management, testing, debugging, programming methodology, and (my personal favorite) over-engineering.

Looking back on this paper, what strikes me is the absence of acknowledgment of the role that human operators play in the types of systems that Hoare writes about (health equipment, aircraft, weapons systems, industrial processes, nuclear power). In fact, the only time the word “operator” appears in the text is when it precedes the word “error” (emphasis added)

The ultimate and very necessary defence of a real time system against arbitrary hardware error or operator error is the organisation of a rapid procedure for restarting the entire system.

Ironically, the above line is the closest Hoare gets to recognizing the role that humans can play in keeping the system running.

The problem with the question “How did software get so reliable without proof?” is that it’s asking the wrong question. It’s not that software got so reliable without proof: it’s that systems that include software got so reliable without proof.

By focusing only on the software, Hoare missed the overall system. And whether you call them socio-technical systems, software-intensive systems, or joint cognitive systems, if you can’t see the larger system, you are doomed to not even be able to ask the right questions.

Contributors, mitigators & risks: Cloudflare 2019-07-02 outage

John Graham-Cumming, Cloudflare’s CTO, wrote a detailed writeup of a Cloudflare incident that happened on 2019-07-02. Here’s a categorization similar to the one I did for the Stripe outage.

Note that Graham-Cumming has a “What went wrong” section in his writeup where he explicitly enumerates 11 different contributing factors; I’ve sliced things a little differently here: I’ve taken some of those verbatim, reworded some of them, and left out some others.

All quotes from the original writeup are in italics.

Contributing factors

Remember not to think of these as “causes” or “mistakes”. They are merely all of the things that had to be true for the incident to manifest, or for it to be as severe as it was.

Regular expression lead to catastrophic backtracking

A regular expression used in a firewall engine rule resulted in catastrophic backtracking:


Simulated rules run on same nodes as enforced rules

This particular change was to be deployed in “simulate” mode where real customer traffic passes through the rule but nothing is blocked. We use that mode to test the effectiveness of a rule and measure its false positive and false negative rate. But even in the simulate mode the rules actually need to execute and in this case the rule contained a regular expression that consumed excessive CPU.

Failure mode prevented access to internal services

But getting to the global WAF [web application firewall] kill was another story. Things stood in our way. We use our own products and with our Access service down we couldn’t authenticate to our internal control panel … And we couldn’t get to other internal services like Jira or the build system.

Security feature disables credentials for infrequent use for an operator interface

[O]nce we were back we’d discover that some members of the team had lost access because of a security feature that disables their credentials if they don’t use the internal control panel frequently

Bypass mechanisms not frequently used

And we couldn’t get to other internal services like Jira or the build system. To get to them we had to use a bypass mechanism that wasn’t frequently used (another thing to drill on after the event). 

WAF changes are deployed globally

The diversity of Cloudflare’s network and customers allows us to test code thoroughly before a release is pushed to all our customers globally. But, by design, the WAF doesn’t use this process because of the need to respond rapidly to threats … Because WAF rules are required to address emergent threats they are deployed using our Quicksilver distributed key-value (KV) store that can push changes globally in seconds

The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.

The fact that WAF changes can only be done globally exacerbated the incident by increasing the size of the blast radius.

WAF implemented in Lua, which uses PCRE

Cloudflare makes use of Lua extensively in production … The Lua WAF uses PCRE internally and it uses backtracking for matching and has no mechanism to protect against a runaway expression.

The regular expression engine being used didn’t have complexity guarantees.

Based on the writeup, it sounds like they used the PCRE regular expression library because PCRE is the regex library that ships with Lua, and Lua is the language they use to implement the WAF.

Protection accidentally removed by a performance improvement refactor

A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.

Cloudflare dashboard and API are fronted by the WAF

Our customers were unable to access the Cloudflare Dashboard or API because they pass through the Cloudflare edge.


Paging alert quickly identified a problem

At 13:42 an engineer working on the firewall team deployed a minor change to the rules for XSS detection via an automatic processThree minutes later the first PagerDuty page went out indicating a fault with the WAF. This was a synthetic test that checks the functionality of the WAF (we have hundreds of such tests) from outside Cloudflare to ensure that it is working correctly. This was rapidly followed by pages indicating many other end-to-end tests of Cloudflare services failing, a global traffic drop alert, widespread 502 errors and then many reports from our points-of-presence (PoPs) in cities worldwide indicating there was CPU exhaustion.

Engineers recognized high severity based on alert pattern

This pattern of pages and alerts, however, indicated that something gravely serious had happened, and SRE immediately declared a P0 incident and escalated to engineering leadership and systems engineering.

Existence of a kill switch

At 14:02 the entire team looked at me when it was proposed that we use a ‘global kill’, a mechanism built into Cloudflare to disable a single component worldwide.


Declarative program performance is hard to reason about

Regular expressions are examples of declarative programs (SQL is another good example of a declarative programming language). Declarative programs are elegant because you can specify what the computation should do without needing to specify how the computation can be done.

The downside is that it’s impossible to look at a declarative program and understand the performance implications, because there isn’t enough information in a declarative program to let you know how it will be executed! You have to be familiar with how the interpreter/compiler works to understand the performance implications of a declarative program. Most programmers probably don’t know how regex libraries are implemented.

Simulating in production environment

For rule-based systems, it’s enormously valuable for an engineer to be able to simulate what effect the rules will have before they’re put into effect, as it is generally impossible to reason about their impacts without doing simulation.

The more realistic the simulation is, the more confidence we have that the results of the simulation will correspond to the actual results when the rules are enabled in production.

However, there is always a risk of doing the simulation in the production environment, because the simulation is a type of change, and all types of change carry some risk.

Severe outages happen infrequently

[S]ome members of the team had lost access because of a security feature that disables their credentials if they don’t use the internal control panel frequently.

To get to [internal services] we had to use a bypass mechanism that wasn’t frequently used (another thing to drill on after the event). 

The irony is that when we only encounter severe outages infrequently, we don’t have the opportunity to exercise the muscles we need to use when these outages do happen.

Large blast radius

The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.

In the future, it sounds like non-emergency rule changes will be staged at Cloudflare. But the functionality will still exist for global changes, because it needs to be there for emergency rule changes. They can reduce the amount of changes that need to get pushed globally, but they can’t drive it down to zero. This is an inevitable risk tradeoff.


Why hasn’t this happened before?

You’re not generally supposed to ask “why” questions, but I can’t resist this one. Why did it take so long for this failure mode to manifest? Hadn’t any of the engineers at Cloudflare previously written a rule that used a regex with pathological backtracking behavior? Or was it that refactor that removed their protection from excessive CPU load in the case of regex backtracking?

What was the motivation for the refactor?

A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.

What was the reason they were trying to make the WAF use less CPU? Were they trying to reduce the cost by running on fewer nodes? Were they just trying to run cooler to reduce the risk of running out of CPU? Was there some other rationale?

What’s the history of WAF rule deployment implementation?

The SOP allowed a non-emergency rule change to go globally into production without a staged rollout. [emphasis added]

The WAF rule system is designed to support quickly pushing out rules globally to protect against new attacks. However, not all of the rules require quick global deployment. Yet, this was the only mechanism that the WAF system supported, even though code changes support staged rollout.

The writeup simply mentions this as a contributing factor, but I’m curious as to how the system came to be that way. For example, was it originally designed with only quick rule deployment in mind? Were staged code deploys introduced into Cloudflare only after the WAF system was built?

Other interesting notes

The WAF rule update was normal work

At 13:42 an engineer working on the firewall team deployed a minor change to the rules for XSS detection via an automatic process. 

Based on the writeup, it sounds like this was a routine change. It’s important to keep in mind that incidents often occur as a result of normal work.

Multiple sources of evidence to diagnose CPU issue

The Performance Team pulled live CPU data from a machine that clearly showed the WAF was responsible. Another team member used strace to confirm. Another team saw error logs indicating the WAF was in trouble.

It was interesting to read how they triangulated on high CPU usage using multiple data sources.

Normative language in the writeup

Emphasis added in bold.

We know how much this hurt our customers. We’re ashamed it happened.

The rollback plan required running the complete WAF build twice taking too long.

The first alert for the global traffic drop took too long to fire.

We didn’t update our status page quickly enough.

Normative language is one of the three analytical traps in accident investigation. If this was an internal writeup, I would avoid the language criticizing the rollback plan, the alert configuration, and the status page decision, and instead I’d ask questions about how these came to be, such as:

Was this the first time the rollback plan was carried out? (If so, that may explain the reason why it wasn’t known how long it would take).

Is the global traffic drop alert configuration typical of the other alerts, or different? If it’s different (i.e., other alerts fire faster?) what led to it being different? If it’s similar to other alert configurations, that would explain why it was configured to be “too long”.

Work to reduce CPU usage contributed to excessive CPU usage

A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.

These sorts of unintended consequences are endemic when making changes within complex systems. It’s an important reminder that the interventions we implement to prevent yesterday’s incidents from recurring may contribute to completely new failure modes tomorrow.

Contributors, mitigators & risks: Stripe 2019-07-10 outage

Stripe’s CTO, David Singleton, did a detailed narrative writeup of the incident they had on 2019-07-10. I love narrative descriptions of incidents, and there’s a ton of great detail here.

As an exercise, using the writeup, I collected some aspects of the incident into the following sections:

Contributing factors: what were all of the conditions that had to be present for the outage to happen, or for it to be as severe as it was? It’s important not to think of these as causes, or mistakes, or even bad things.

Mitigators: What kept the incident from being worse that it was?

Risks: What are the more general risks that this incident reveals?

In an ideal world, I’d talk to the people involved directly to get more details, but we work with what we have. I don’t summarize the incident here, so I recommend reading the Stripe writeup first.

Text in italics is copied verbatim from the writeup.

Contributing factors

Minor database version upgrade

Three months ago, we upgraded our databases to a new minor version.the new version … introduced a subtle fault in the database’s failover system that only manifested in the presence of multiple stalled nodes.

One shard had multiple stalled nodes

Two nodes became stalled for yet-to-be-determined reasons. …  a subtle fault in the database’s failover system … only manifested in the presence of multiple stalled nodesOn the day of the events, one shard was in the specific state that triggered this fault, and the shard was unable to elect a new primary.

Stalled nodes reported as healthy

 These [stalled] nodes stopped emitting metrics reporting their replication lag but continued to respond as healthy to active checks.

The database nodes support health checks, but these health checks did not detect the problem. We aren’t provided with any more details about the health check failure mode.

Database writes time out when shard has no elected primary

Without a primary, the shard was unable to accept writes. Applications that write to the shard began to time out. 

Problem manifested on a critical shard

Stripe splits data by kind into different database clusters and by quantity into different shards. Because of widespread use of this shard across applications, including the API, the unavailability of this shard … cascaded into a severe API degradation.

Based on this description, it sounds like this incident would have been much less severe if the problem had manifested on a shard other than this one.

We don’t know how it was that this particular shard was the one where the nodes stalled. It might just be bad luck. Sometimes, that’s the only difference between an incident and a surprise.

Timeouts lead to compute resource starvation

Applications that write to the shard began to time out. Because of widespread use of this shard across applications, including the API, the unavailability of this shard starved compute resources for the API

Novel, complex failure mode

  • [2019-07-10 16:36 UTC] Our team was alerted and we began incident response.
  • [2019-07-10 16:50 UTC] We determined the cluster was unable to elect a primary.

Because this was a complex failure mode that we had not previously experienced, we needed to diagnose the underlying cause and determine the steps to remediate.

The language in the writeup suggests that the complexity and novelty of the failure mode made it more difficult for them to diagnose the problem. However, the timeline suggests that it took them about 14 minutes to figure out that the database was in a bad state. That sounds pretty good to me(!).

Remediation required restarting database cluster

  • [2019-07-10 16:50 UTC] We determined the cluster was unable to elect a primary.
  • [2019-07-10 17:00 UTC] We restarted all nodes in the database cluster, resulting in a successful election.
  • [2019-07-10 17:02 UTC] The Stripe API fully recovered.

Our team identified forcing the election of a new primary as the fastest remediation available, but this required restarting the database cluster. 

The “but” in the sentence above suggests that restarting the database cluster was not an ideal remediation strategy, but there isn’t a rationale given for why it isn’t. It’s not clear from the timeline how long it took to reboot all of the database nodes: it looks like it could be 2 minutes, which sounds pretty quick to me.

Rolling back database version as remediation strategy

  • [2019-07-10 20:13 UTC] During our investigation into the root cause of the first event, we identified a code path likely causing the bug in a new minor version of the database’s election protocol.
  • [2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.

After mitigating user impact, we investigated the root cause and identified a likely code path in a new version of the database’s election protocol. We decided to revert to the previous known stable version for all shards of the impacted cluster.  We deployed this change within four minutes, and until 21:14 UTC the cluster was healthy.

In the moment, rolling back the database version was clearly the rational action to take. Unfortunately for Stripe …. well, see the next contributing factor.

(Also, 4 minutes sounds pretty quick for reverting that database version!)

Recent configuration change to affected shards

 [T]he second period of degradation had a different cause: our revert to a known stable version interacted poorly with a recently-introduced configuration change to the production shards. This interaction resulted in CPU starvation on all affected shards.

We don’t have additional information about this configuration change: presumably it happened after the new version of the database had been deployed.

Second, novel failure mode had same symptoms as the first failure mode

We initially assumed that the same issue had reoccurred on multiple shards, as the symptoms appeared the same as the earlier event. We therefore followed the same mitigation playbook that succeeded earlier.


Unfortunately, there’s not much detail in the writeup to identify the mitigating factors that were in play here, which is a shame, because it sounds like Stripe was able to employ a lot of expertise in order to effectively diagnose and remediate the problems they encountered. And there’s just as much that we can learn from what went right.

Monitoring quickly detected a database problem

Automated monitoring detected the failed election within a minute.

Quick engagement

We began incident response within two minutes.


Gray failure (sensor problem)

These nodes stopped emitting metrics reporting their replication lag but continued to respond as healthy to active checks.

The stalled database nodes passed their health checks. This is a classic example of a gray failure, where there’s some internal failure but it isn’t detected by the internal failure detector.

Gray failures are pernicious because it’s very difficult (perhaps impossible?) to design a system that can handle failures that it cannot detect. These can also be hard to diagnose, because some of the sensors we are depending on to tell us about the state of the world are not giving us the complete story. We have to depend on integrating multiple sources of data, none of which are every completely reliable.

Service with many dependents goes latent

Because of widespread use of this shard across applications, including the API, the unavailability of this shard starved compute resources for the API and cascaded into a severe API degradation.

It’s very difficult to reason about how a distributed system behaves when one of the services goes latent (that’s one of the value propositions of chaos engineering approaches, like ChAP). In circumstances when a service has multiple dependencies, a latency increase can ripple across the system with dire consequences.

Interaction vulnerabilities that involve rare events

As part of the upgrade, we performed thorough testing in our quality assurance environment, and executed a phased production rollout, starting with less critical clusters and moving on to increasingly critical ones. The new version operated properly in production for the past three months, including many successful failovers. However, the new version also introduced a subtle fault in the database’s failover system that only manifested in the presence of multiple stalled nodes. 

Phased rollouts are a great way to build confidence when rolling out new changes. However, in some cases the condition that will trigger a failure mode doesn’t occur often enough to be caught during a phased rollout process.

In this case, the triggering event was when multiple nodes were stalled. That was an uncommon enough event that it didn’t happen during the phased deployment.

Remediations can introduce new failure modes

Incident reviews generally produce action items that are intended to ensure that the same problem doesn’t recur. The risk with these remediations is that they introduce entirely new problems. In this particular case, the database rollback, a remediation action item, introduced a new failure mode.

We should remediate known problems! But we should also always be mindful that focusing too much on “let’s make sure this failure mode can never happen again” can crowd out questions like “how might these proposed remediations lead to new failure modes?”

(I don’t fault Stripe for their actions in this case: I’m quite certain I would have taken the same action as they did in rolling back the database version to the last known good one).

Stripe outlined the following remediation actions going forward.

We are also introducing several changes to prevent failures of individual shards from cascading across large fractions of API traffic. This includes additional circuit-breaking on failed operations to particular clusters, including the one implicated in these events. We will also pursue additional fault isolation techniques to contain the impact of a single failed shard and limit resource consumption by clients attempting repeated retries of failed requests.

It’s not hard to imagine that these new circuit breaker, fault isolation, and resource consumption limiting strategies may new and even more complex failure modes.

Same symptoms, different problem

  • [2019-07-10 16:35 UTC] The first period of degradation started when the primary node for the database cluster failed.
  • [2019-07-10 17:02 UTC] The Stripe API fully recovered.
  • [2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.
  • [2019-07-10 21:14 UTC] We observed high CPU usage in the database cluster. The Stripe API started returning errors for users, marking the start of a second period of severe degradation.

Anyone who has done operations work before will tell you that if you get paged the same day with the same symptoms, you are going to assume it is a reoccurrence of the issue you just remediated. And, usually, it is. But sometimes it isn’t, and that’s what happened in this case.

Rollback leads to unexpected interaction

You can never really roll back a distributed system to a previous state. A rollback, like any other kind of change, can have unexpected consequences due to interactions with other parts of the system that have since changed. It’s easy to forget this, especially since rollbacks are usually effective as a remediation strategy!

Unanswered questions

Here are some questions I had that aren’t addressed in the writeup.

What was the rationale for the original migration?

The writeup doesn’t describe the rationale for upgrading the databases to a new minor version in the first place. Was it to fix an ongoing issue? To leverage a new feature? Good hygiene in keeping versions up to date?

How did they identify that nodes were stalled?

Did the identification of stalled nodes happen in-the-moment, or was this part of post-incident investigation? How did they diagnose that nodes were stalled?

How did they diagnose the failure mode?

How did they figure out that the failure mode was an interaction between stalled nodes and the new version of the database?

Were there risks associated with restarting the entire database cluster?

In hindsight, restarting the database cluster was the right thing to do. How did things look in the moment?

What database are they using?

The writeup doesn’t say which database they were using, and which versions they were running and upgraded to.

What was the problem with the database node health checks?

These nodes stopped emitting metrics reporting their replication lag but continued to respond as healthy to active checks.

What led to the database node health checks reporting healthy for stalled nodes?

How did the two database nodes stall?

Two nodes became stalled for yet-to-be-determined reasons.

Stripe didn’t know the answer to this question at the time of the writeup.

What was the nature of the configuration change?

[O]ur revert to a known stable version interacted poorly with a recently-introduced configuration change to the production shards. 

Stripe doesn’t provide any details on the nature of this configuration change . What was changed? What was the rationale for the change? When it did it happen?

How did the configuration change interact with the database version rollback?

This interaction resulted in CPU starvation on all affected shards.

The writeup doesn’t provide any details into the nature of the CPU starvation other than what is written above.

How did they diagnose the configuration change as a contributing factor?

 Once we observed the CPU starvation, we were able to investigate and identify the root cause. 

How did they investigate this? Where did they look? How did they trace it back to the config change?

Final thought

The database upgrade was normal work

Three months ago, we upgraded our databases to a new minor version. As part of the upgrade, we performed thorough testing in our quality assurance environment, and executed a phased production rollout, starting with less critical clusters and moving on to increasingly critical ones. The new version operated properly in production for the past three months, including many successful failovers.

It doesn’t sound like there was anything atypical about the way they rolled out the new database version. Incidents often happen as a result of normal work! Even more often, though, incidents don’t happen as a result of normal work.

For the rollback, my impression from the writeup is that it was rolled out more quickly than normal database version changes, as a remediation for a known problem for a critical database shard.

Why incidents can’t be monocausal

When an incident happens, the temptation is strong to identify a single cause. It’s as if the system is a chain, and we’re looking for the weak link that was responsible for the chain breaking. But, in organizations that are going concerns, that isn’t how the system works. It can’t be, because there are simply too many things that can and do go wrong. Think of all the touch points in your system, how many opportunities there are for problems (bugs, typo in config, bad data, …). If any one of these was enough to take down the system, then it would be like a house of cards, falling down all of the time.

What happens in successful organizations as that the system evolves layers of defense, so that it can survive the kinds of individual problems that are always cropping up. Sure, the system still goes down, and more often than we would like. But the uptime is good enough that the company continues to survive.

Here’s an analogy that I’m borrowing from John Allspaw. Think about a significant new feature or service that your organization delivered successfully: one that took multiple quarters and required the collaboration of multiple teams. I’d wager that there were many factors that contributed to the success of this effort. Imagine if someone asked you: “what was the root cause for the success of this feature?”

So it is with incidents. Because an organization can’t prevent the occurrence of individual problems, the system evolves defenses to protect itself, created by the everyday work of the people in the company. Sure, the code we write might not even compile on the first try, but somehow the code that made it out to production is running well enough that the company is still in business. People are doing checks on the system all of the time, and most of this work is invisible.

For an incident to happen, multiple factors must have contributed to penetrate those layers of defenses that have evolved. I say that with confidence, because if a single event could take your system down, then it never would have made it this far to begin with. That’s why, when you dig into an incident, you’ll always find those multiple contributors.

Elegance: UI vs implementation

If you ask the question, “What is a Docker container?”, it turns out that the Linux operating system doesn’t actually have a notion of a container at all. Instead, a Docker container refers to a cobbled together set of Linux technologies, such as cgroups, network namespaces, and union filesystems. However, from the point of view of the end-user, a container is very much a real thing. In particular, it exposes to the user images as an entity, and a command-line tool for pulling down images from a repository and running them.

The Docker container implementation may be built (in the Unix tradition!) with duct table and baling wire, but the user interface is elegant. It’s easy for a new user to get started with Docker once they’ve installed it. Bryan Cantrill points out that the advantage of Docker over container technologies developed in the BSD world is Docker’s notion of images as effectively static binaries that allow developers to think operationally and move faster.

Contrast that with git. The implementation is quite elegant: git represents the data under version control as a hash tree, with pointers into nodes in the tree. Git commands are tree manipulations: adding leaf nodes, moving branches from one part of the tree to another, smooshing nodes together, and so on. (I used Subversion for years and had no idea what was going on under the hood).

On the other hand, the command-line interface that git exposes is a nightmare. It’s so hard to use that trying to build a more usable command-line interface is a full-blown academic research project.

The elegance of a user interface and the elegance of an implementation are orthogonal. One doesn’t necessarily lead to the other.

Inductive invariants

Note: to understand this post, it helps to have some familiarity with TLA+. 

Leslie Lamport is a big advocate of inductive invariants. In Teaching Concurrency, he introduces the following simple concurrent algorithm:

Consider N processes numbered from 0 through N−1 in which each process i executes:

x[i] := 1;
y[i] := x[(i−1) mod N]

and stops, where each x[i] initially equals 0.

Lamport continues:

 This algorithm satisfies the following property: after every process has stopped, y[i] equals 1 for at least one process i.

The algorithm satisfies this property because it maintains an inductive invariant [emphasis added]. Do you know what that invariant is? If not, then you do not completely understand why the algorithm satisfies this property. How can a computer engineer design a correct concurrent system without understanding it?

Before I read up on TLA+, I was familiar with the idea of an invariant, a property that is always true throughout the execution of a program. But I didn’t know what an inductive invariant is.

Mathematical induction

An inductive invariant of a specification is an invariant you can prove using mathematical induction. You probably had to do proofs by induction in college or high school math.

To refresh your memory, let’s say you were asked to prove the following formula:

The way this is proved this by induction is:

  1. Base case: Prove it for n=1
  2. Inductive case: Assume it is true for n, and prove that it holds for n+1

In TLA+, you typically specify the allowed behaviors of a specification by writing two predicates:

  1. The initial conditions, conventionally written as Init
  2. The allowed state transitions (which are called steps in TLA+), conventionally written as Next

Let’s say we have a state predicate which we believe is an inductive invariant, which I’ll call Inv. To prove that Inv is an inductive invariant, you need to:

  1. Base case: Prove that Inv is true when Init is true
  2. Inductive case: Assume that Inv holds for an arbitrary state s, prove that Inv holds for an arbitrary state t where the step s→t is one of the state transitions permitted by Next.

In TLA+ syntax, the top-level structured proof of an inductive invariant looks like this:

THEOREM Spec=>[]Inv
<1>1. Init => Inv
<1>2. Inv /\ [Next]_vars => Inv'
<1>3. QED
BY <1>1,<1>2

In practice, you want to prove some invariant that is not an inductive invariant. The proof is typically structured like this:

Correct = ...        \* The invariant you really want to prove
Inv = ... /\ Correct \* the inductive invariant

THEOREM Spec=>[]Correct
<1>1. Init => Inv
<1>2. Inv /\ [Next]_vars => Inv'
<1>3. Inv => Correct
<1>4. QED
BY <1>1, <1>2, <1>3

Not all invariants are inductive

If you’re using TLA+ to check the behavior of a system, you’ll have an invariant that you want to check, often referred to as a correctness property or a safety property. However, not all such invariants are inductive invariants. They usually aren’t.

As an example, here’s a simple algorithm that starts a counter at either 1 or 2, and then decrements it until it reaches 0. The algorithm is written in PlusCal notation.

--algorithm Decrement
variable i \in {1, 2};
while i>0 do
i := i-1;
end while
end algorithm

It should be obvious that i≥0 is an invariant of this specification. Let’s call it Inv. But Inv isn’t an inductive invariant. To understand why, consider the following program state:


Here, Inv  is true, but Inv’ is false, because on the next step of the behavior, i will be decremented to -1.

Consider the above diagram, which is associated associated with some TLA+ specification (not the one described above). A bubble represents a state in TLA+. A bubble is colored gray if it is an initial state in the specification (i.e. if the Init predicate is true of that state). An arrow represents a step, where the Next action is true for the pair of states associated with the arrow.

The inner region shows all of the states that are in all allowed behaviors of
a specification. The outer region represents an invariant: it contains all states where the invariant holds. The middle region represents an inductive invariant: it contains all states where an inductive invariant holds.

Note how there is an arrow from a state where the invariant holds to a state
where the invariant doesn’t hold. That’s because invariants are not inductive in general.

In contrast, for states where the inductive invariant holds, all arrows that
start in those states terminate in states where the inductive invariant holds.

Finding an inductive invariant

The hardest part of proof by inductive invariance is finding the inductive invariant for your specification. If the invariant you come up with isn’t inductive, you won’t be able to prove it by induction.

You can use TLC to help find an inductive invariant. See Using TLC to Check Inductive Invariance for more details.

I want to learn more!

The TLA+ Hyperbook has a section called The TLA+ Proof Track about how to write proofs in TLA+ that can be checked mechanically using TLAPS.

I’ve only dabbled in writing proofs. Here are two that I’ve written that were checked by TLAPS, if you’re looking for simple examples:

TLA+ is hard to learn

I’m a fan of the formal specification language TLA+. With TLA+, you can build models of programs or systems, which helps to reason about their behavior.

TLA+ is particularly useful for reasoning about the behavior of multithread programs and distributed systems. By requiring you to specify behavior explicitly, it forces you to think about interleaving of events that you might not otherwise have considered.

The user base of TLA+ is quite small. I think one of the reasons that TLA+ isn’t very popular is that it’s difficult to learn. I think there are at least three concepts you need for TLA+ that give new users trouble:

  • The universe as a state machine
  • Modeling programs and systems with math
  • Mathematical syntax

The universe as state machine

TLA+ uses a state machine model. It treats the universe as a collection of variables whose values vary over time.

A state machine in sense that TLA+ uses it is similar, but not exactly the same, as the finite state machines that software engineers are used to. In particular:

  • A state machine in the TLA+ sense can have an infinite number of states.
  • When software engineers think about state machines, they think about a specific object or component being implemented as a finite state machine. In TLA+, everything is modeled as a state machine.

The state machine view of systems will feel familiar if you have a background in physics, because physicists use the same approach for system modeling: they define a state variable that evolves over time. If you squint, a TLA+ specification looks identical to a system of first-order differential equations, and associated boundary conditions. But, for the average software engineer, the notion of an entire system as an evolving state variable is a new way of thinking.

The state machine approach requires a set of concepts that you need to understand. In particular, you need to understand behaviors, which requires that you understand statessteps, and actions. Steps can stutter, and actions may or may not be enabled. For example, here’s the definition of “enabled” (I’m writing this from memory):

An action a is enabled for a state s if there exists a state t such that a is true for the step s→t.

It took me a long time to internalize these concepts to the point where I could just write that out without consulting a source. For a newcomer, who wants to get up and running as quickly as possible, each new concept that requires effort to understand decreases the likelihood of adoption.

Modeling programs and systems with math

One of the commonalities across engineering disciplines is that they all work with mathematical models. These models are abstractions, objects that are simplified versions of the artifacts that we intend to build. That’s one of the thing that attracts me about TLA+, it’s modeling for software engineering.

A mechanical engineer is never going to confuse the finite element model they have constructed on a computer with the physical artifact that they are building. Unfortunately, we software engineers aren’t so lucky. Our models superficially resemble the artifacts we build (a TLA+ model and a computer program both look like source code!). But models aren’t programs: a model is a completely different beast, and that trips people up.

Here’s a metaphor: You can think of writing a program as akin to painting, in that both are additive work: You start with nothing and you do work by adding content (statements to your program, paint to a canvas).

The simplest program, equivalent to an empty canvas, is one that doesn’t do anything at all. On Unix systems, there’s a program called true which does nothing but terminate successfully. You can implement this in shell as an empty file. (Incidentally, AT&T has copyrighted this implementation).

By contrast, when you implement a model, you do the work by adding constraints on the behavior of the state variables. It’s more like sculpting, where you start with everything, and then you chip away at it until you end up with what you want.

The simplest model, the one with no constraints at all, allows all possible behaviors. Where the simplest computer program does nothing, the simplest model does (really allows) everything. The work of modeling is adding constraints to the possible behaviors such that the model only describes the behaviors we are interested in.

When we write ordinary programs, the only kind of mistake we can really make is a bug, writing a program that doesn’t do what it’s supposed to. When we write a model of a program, we can also make that kind of mistake. But, we can make another kind of mistake, where our model allows some behavior that would never actually happen in the real world, or isn’t even physically possible in the real world.

Engineers and physicists understand this kind of mistake, where a mathematical model permits a behavior that isn’t possible in the real world. For example, electrical engineers talk about causal filters, which are filters whose outputs only depend on the past and present, not the future. You might ask why you even need a word for this, since it’s not possible to build a non-causal physical device. But it’s possible, and even useful, to describe non-causal filters mathematically. And, indeed, it turns out that filters that perfectly block out a range of frequencies, are not causal.

For a new TLA+ user who doesn’t understand the distinction between models and programs, this kind of mistake is inconceivable, since it can’t happen when writing a regular program. Creating non-causal specifications (the software folks use the term “machine-closed” instead of “causal”) is not a typical error for new users, but underspecifying the behavior some variable of interest is very common.

Mathematical syntax

Many elements of TLA+ are taken directly from mathematics and logic. For software engineers used to programming language syntax, these can be confusing at first. If you haven’t studied predicate logic before, the universal (∀) and extensional (∃) quantifiers will be new.

I don’t think TLA+’s syntax, by itself, is a significant obstacle to adoption: software engineers pick up new languages with unfamiliar syntax all of the time. The real difficulty is in understanding TLA+’s notion of a state machine, and that modeling is describing a computer program as permitted behaviors of a state machine. The new syntax is just one more hurdle.

Why we will forever suffer from missing timeouts, TTLs, and queue size bounds

If you’ve operated a software service, you will have inevitably hit one of the following problems:

A network call with a missing timeout.  Some kind of remote procedure call or other networking call is blocked waiting … forever, because there’s no timeout configured on the call.

Missing time-to-live (TTL). Some data that was intended to be ephemeral did not explicitly have a TTL set on it, and it didn’t get removed by normal means, and so its unexpected presence bit you.

A queue with no explicit size limit. A queue somewhere doesn’t have an explicitly configured upper bound on its size, and somehow the producers are consistently outnumbering the consumers, and then your queue eventually grows to some size that you never expected to happen.

Unfortunately, the only good solution to these problems is diligence. We have to remember to explicitly set timeouts, TTLs, and queue sizes. there are two reasons:

It’s impossible for a library author to define a reasonable default for these values. Appropriate timeouts, TTLs, and queue sizes will vary enormously from one use case to another, there simply isn’t a “reasonable” value to pick without picking one so large that it’s basically the same as being unbounded.

Forcing users to always specify values is a lousy user experience for new users. Library authors could make these values required rather than optional. However, this makes it more annoying for new users of these libraries, it’s an extra step that forces them to make a decision they don’t really want to think about. They probably don’t even know what a reasonable value is when they’re first setting out.

I think forcing users to specify these limits would lead to more robust software, but I can see many users complaining about being forced to set these limits rather than defaulting to infinity. At least, that’s my guess about why library authors don’t do it. 

The Tortoise and the Hare in TLA+

Problem: Determine if a linked list contains a cycle, using O(1) space.

Robert Floyd invented an algorithm for solving this problem in the late 60s, which is called “The Tortoise and the Hare”[1].

(This is supposedly a popular question to ask in technical interviews. I’m not a fan of expecting interviewees to re-invent the algorithms of famous computer scientists on the spot).

As an exercise, I implemented the algorithm in TLA+, using PlusCal.

The algorithm itself is very simple: the hardest part was deciding how to model linked lists in TLA+. We want to model the set of all linked lists, to express that algorithm should work for any element of the set. However, the model checker can only work with finite sets. The typical approach is to do something like “the set of all linked lists which contain up to N nodes”, and then run the checker against different values of N.

What I ended up doing was generating N nodes, giving each node a successor

(a next element, which could be the terminal value NIL), and then selecting

the start of the list from the set of nodes:


ASSUME N \in Nat

Nodes == 1..N

NIL == CHOOSE NIL : NIL \notin Nodes

start \in Nodes
succ \in [Nodes -> Nodes \union {NIL}]

(Note: I originally defined NIL as a constant, but this is incorrect, see the comment by Ron Pressler for more details).
This is not the most efficient way from a model checker point of view, because the model checker will generate nodes that are irrelevant because they aren’t in the list. However, it does generate all possible linked lists up to length N.

Superficially, this doesn’t look like the pointer-and-structure linked list you’d see in C, but it behaves the same way at a high level. It’s possible to model a memory address space and pointers in TLA+, but I chose not to do so.

In addition, the nodes of a linked list typically have an associated value, but Floyd’s algorithm doesn’t use this value, so I didn’t model it.

Here’s my implementation of the algorithm:

EXTENDS Naturals


ASSUME N \in Nat

Nodes == 1..N

NIL == CHOOSE NIL : NIL \notin Nodes

--fair algorithm TortoiseAndHare

    start \in Nodes,
    succ \in [Nodes -> Nodes \union {NIL}],
    cycle, tortoise, hare, done;
h0: tortoise := start;
    hare := start;
    done := FALSE;
h1: while ~done do
        h2: tortoise := succ[tortoise];
            hare := LET hare1 == succ[hare] IN
                    IF hare1 \in DOMAIN succ THEN succ[hare1] ELSE NIL;
        h3: if tortoise = NIL \/ hare = NIL then
                cycle := FALSE;
                done := TRUE;
            elsif tortoise = hare then
                cycle := TRUE;
                done := TRUE;
            end if;
    end while;

end algorithm

I wanted to use the model checker to verify the the implementation was correct:

PartialCorrectness == pc="Done" => (cycle <=> HasCycle(start))

(See the Correctness wikipedia page for why this is called “partial correctness”).

To check correctness, I needed to implement my HasCycle operator (without resorting to Floyd’s algorithm). I used the transitive closure of the successor function for this, which is called TC here. If the transitive closure contains the pair (node, NIL), then the list that starts with node does not contain a cycle:

HasCycle(node) == LET R == {<<s, t>> \in Nodes \X (Nodes \union {NIL}): succ[s] = t }
                  IN <<node, NIL>> \notin TC(R)

The above definition is “cheating” a bit, since we’ve defined a cycle by what it isn’t. It also wouldn’t work if we allowed infinite lists, since those don’t have cycles nor do they have NIL nodes.

Here’s a better definition: a linked list that starts with node contains a cycle if there exists some node n that is reachable from node and from itself.

HasCycle2(node) ==
  LET R == {<<s, t>> \in Nodes \X (Nodes \union {NIL}): succ[s] = t }
  IN \E n \in Nodes : /\ <<node, n>> \in TC(R) 
                      /\ <<n, n>> \in TC(R)

To implement the transitive closure in TLA+, I used an existing implementation

from the TLA+ repository itself:

TC(R) ==
  LET Support(X) == {r[1] : r \in X} \cup {r[2] : r \in X}
      S == Support(R)
      TCR(T) == IF T = {} 
                  THEN R
                  ELSE LET r == CHOOSE s \in T : TRUE
                           RR == TCR(T \ {r})
                       IN  RR \cup {<<s, t>> \in S \X S : 
                                      <<s, r>> \in RR /\ <<r, t>> \in RR}
  IN  TCR(S)

The full model is the lorin/tla-tortoise-hare repo on GitHub.

  1. Thanks to Reginald Braithwaite for the reference in his excellent book Javascript Allongé.  ↩