Making sense of what happened is hard

Scott Nasello recently introduced me to Dr. Hannah Harvey’s The Art of Storytelling. I’m about halfway through her course, and I absolutely love it, and I keep thinking about it in the context of learning from incidents. While I have long been an advocate of using narrative structure when writing up incidents, Harvey’s course focuses on oral storytelling, which is a very different sort of format.

In this context, I was thinking about an operational surprise that happened on my team a few months ago, so that I could use it as raw material to construct an oral story about it. But, as I reflected on it, and read my own (lengthy) writeup, I realized that there was one thing I didn’t fully understand about what happened.

During the operational surprise, when we attempted to remediate the problem by deploying a potential fix into production, we hit a latent bug that had been merged into the main branch ten days earlier. As i was re-reading the writeup, there was something I didn’t understand. How did it come to be that we went ten days without promoting that code from the main branch of our repo to the production environment?

To help me make sense of what happened, I drew a diagram of the development events that lead up to the surprise. Fortunately, I had documented those events thoroughly in the original writeup. Here’s the diagram I created. I used this diagram to get some insight into how bug T2, which was merged into our repo on day 0, did not manifest in production until day 10.

This diagram will take some explanation, so bear with me.

There are four bugs in this story, denoted T1,T2, A1, A2. The letters indicate the functionality associated with the PR that introduced them:

  • T1, T2 were both introduced in a pull request (PR) related to refactoring of some functionality related to how our service interacts with Titus.
  • A1, A2 were both introduced in a PR related to adding functionality around artifact metadata.

Note that bug T1 masked T2, and bug A1 masked A2.

There are three vertical lines, which show how the bugs propagated to different environments.

  • main (repo) represents code in the main branch of our repository.
  • staging represents code that has been deployed to our staging environment.
  • prod represents code that has been deployed to our production environment.

Here’s how the colors work:

  • gray indicates that the bug is present in an environment, but hasn’t been detected
  • red indicates that the effect of a bug has been observed in an environment. Note that if we detect a bug in the prod environment, that also tells us that the bug is in staging and the repo.
  • green indicates the bug has been fixed

If a horizontal line is red, that means there’s a known bug in that environment. For example, when we detect bug T1 in prod on day 1, all three lines go red, since we know we have a bug.

A horizontal line that is purple means that we’ve pinned to a specific version. We unpinned prod on day 10 before we deployed.

The thing I want to call out in this diagram is the color in the staging line. once the staging line turns red on day 2, it only turns black on day 5, which is the Saturday of a long weekend, and then turns red again on the Monday of the long weekend. (Yes, some people were doing development on the Saturday and testing in staging on Monday, even though it was a long weekend. We don’t commonly work on weekends, that’s a different part of the story).

During this ten day period, there was only a brief time when staging was in a state we thought was good, and that was over a weekend. Since we don’t deploy on weekends unless prod is in a bad state, it makes sense that we never deployed from staging to prod until day 10.

The larger point I want to make here is that getting this type of insight from an operational surprise is hard, in the sense that it takes a lot of effort. Even though I put in the initial effort to capture the development activity leading up to the surprise when I first did the writeup, I didn’t gain the above insight until months later, when I tried to understand this particular aspect of it. I had to ask a certain question (how did that bug stay latent for so long), and then I had to take the raw materials of the writeup that I did, and then do some diagramming to visualize the pattern of activity so I could understand it. In retrospect, it was worth it. I got a lot more insight here than: “root cause: latent bug”.

Now I just need to figure out how to tell this as a story without the benefit of a diagram.

Conscription devices, boundary objects, and GDocs

In the paper Flexible Sketches and Inflexible Data Bases: Visual Communication, Conscription Devices, and Boundary Objects in Design Engineering, Kathryn Henderson writes about the role that sketches and drawings play in the work of mechanical engineers.

A conscription device is something that can be used to help recruit other people to get involved in a task: mechanical engineers collaborate using diagrams. These diagrams play such a strong role that the engineers find that they can’t work effectively without them. From the paper:

If a visual representation is not brought to a meeting of those involved with the design, someone will sketch a facsimile on a white board (present in all engineering conference rooms) when communication begins to falter, or a team member will leave the meeting to fetch the crucial drawings so group members will be able to understand one another.

A boundary object is an artifact that can be consumed by different stakeholders, who use the artifact for different purposes. Henderson uses the example of the depiction of a welded joint in a drawing, which has different meanings for the designer (support structure) than it does for someone working in the shop (labor required to do the weld). A shop worker might see the drawing and suggest a change that would save welds (and hence labor).:

Detail renderings are one of the tightly focused portions that make up the more flexible whole of a drawing set. For example, the depiction of a welded joint may stand for part of the support structure to the designer and stand for labor expended to those in the shop.If the designer consults with workers who suggest a formation that will save welds and then incorporates the advice, collective knowledge is captured in the design. One small part of the welders’ tacit knowledge comes to be represented visually in the drawing. Hence the flexibility of the sketch or drawing as a boundary object helps in enlisting the aid and knowledge of additional participants.

Because we software engineers don’t work in a visual medium, we don’t work from visual representations the way that mechanical engineers do. However, we still have a need to engage with other engineers to work with us, and we need to communicate with different stakeholders about the software that we build.

A few months ago, I wrote up a Google doc with a spec for some proposed new functionality for a system that I work on. It included scenario descriptions that illustrated how a user would interact with the system. I shared the doc out, and got a lot of feedback, some of it from potential users of the system who were looking for additional scenarios, and some from adjacent teams who were concerned about the potential misuse of the feature for unintended purposes.

This sort of Google doc does function like a conscription device and boundary object. Google makes it easy to add comments to a doc. Yes, comments don’t scale up well, but the ease of creating a comment makes Google docs effective as potential conscription devices. If you share the doc out, and comments are enabled, people will comment.

I also found that writing out scenarios, little narrative descriptions of people interacting with the system, made it easier for people to envision what using the system will be like, and so I consequently got feedback from different types of stakeholders.

My point here is not that scenarios written in Google docs are like mechanical engineering drawings: those are very different kinds of artifacts that play different roles. Rather, the point is that properties of an artifact can affect how people collaborate to get engineering work done. We probably don’t think of Google doc as a software engineering tool. But it can be an extremely powerful one.

Coding as a tool of thought

With apologies to Ken Iverson.

Architects draw detailed plans before a brick is laid or a nail is hammered. Programmers and software engineers don’t. Can this be why houses seldom collapse and programs often crash?

Blueprints help architects ensure that what they are planning to build will work. “Working” means more than not collapsing; it means serving the required purpose. Architects and their clients use blueprints to understand what they are going to build before they start building it.

But few programmers write even a rough sketch of what their programs will do before they start coding.

Leslie Lamport, Why We Should Build Software Like We Build Houses

“My instinct is to go right to the board. I’m very graphic oriented. I can’t talk more than ten minutes without I [sic] start drawing pictures when we’re talking about the things I do. Even if I’m talking sports, I invariably start diagramming what’s going on. I feel comfortable with it or find it very effective.”

This designer, like the newly promoted engineer at Selco who fought to get her drafting board back, is pointing out his dependence on the visual process, of drawing both to communicate and to think out the initial design. He also states that the visual and manual thought process of drawing precedes the formulation of the written specifications for the project. Like the Selco designers and those at other sites, he emphasized the importance of drawing processes to work out ideas. [emphasis added]

Kathryn Henderson, On Line and On Paper: Visual Representations, Visual Culture, and Computer Graphics in Design Engineering

The quote above by Kathryn Henderson illustrates how mechanical engineers use the act of drawing to help them work on the design problem. By generating sketches and drawings, they develop a better understanding of the problem they trying to solve. They use drawing as a tool to help them think, to work through the problem.

As software engineers, we don’t work in a visual medium in the way that mechanical engineers do. And yet, we also use tools to help us think through the problem. It just so happens that the tool we use is code. I can’t speak for other developers, but I certainly use the process of writing code to develop a deeper understanding of the problem I’m trying to solve. As I solve parts of the problem with code, my mental model of the problem space and solution space develops throughout the process.

I think we use coding this way (I certainly do), because it feels to me like the fastest way to evolve this knowledge. If I had some sort of equivalent mechanism for sketching that was faster than coding for developing my understanding, I’d use it. But I know of no mechanism that’s actually faster than coding that will let me develop my understanding of the solution I’m working on. It just so happens that the quickest solution, code, is the same medium as the artifact that will ultimately end up in production. A mechanical engineer can never ship their sketches, but we can ship our code.

And this is a point that I think Leslie Lamport misses. I’m personally familiar with a number of different techniques for modeling in software, including TLA+, Alloy, statecharts, and decision tables. I’ve used them all, they are excellent tools for reasoning about the complexity of the system. But none of these tools really fulfill the role that sketching does for mechanical engineers (although Alloy’s fast feedback for incremental model building is a nice step in this direction).

TLA+ in particular is a wonderful tool. I’ve used it for things like understanding the linearizability paper, beating the CAP theorem, finding cycles in linked lists, solving the river crossing problem, and proving leftpad. But if you’re looking for an analogy in mechanical engineering, TLA+ is much closer to finite element analysis than it is to sketches.

Developers jump to coding not because they are sloppy, but because they have found it to be the most effective tool for sketching, for thinking about the problem and getting quick feedback as they construct their solution. And constructing a representation to develop a better understanding using the best tools they have available for the job to get quick feedback is what engineers do.

Uber’s adventures in the adaptive universe

It’s 2016, and Uber engineers are facing a problem. Their software system has become brittle: many in the organization feel that it’s too hard to make changes to it without breaking things.

And so, they adapt: they build a new architecture, one that’s designed to enable teams to move more quickly. As part of the re-architecture, they reach for a new technology to rewrite the iOS client in: the Swift language.

The new architecture experiment is deemed a success, and is rolled out to the entire company. A florescence ensues in the organization, as teams excitedly migrate to the new architecture and experience a boost to their development productivity.

However, as development against the new architecture ramps up, anomalies related to Swift begin to emerge. Because of implementation details in the Swift linker, Apple recommends limiting the number of shared libraries to six: Uber has ninety-two, and the number is growing. The linker is saturated, and as a result, app startup is extremely slow. It takes eight to twelve seconds (!) to start up the app. The rewrite was supposed to yield a faster iOS app, and it’s slower than the previous version!

So the engineers adapt. They discover they can work around the problem by putting all of the code in the main executable instead of linking it via libraries, eliminating the startup delay. Unfortunately, to do this would require a huge code change because an implementation detail of Swift, but they find another workaround: an enterprising engineer writes a custom script to relink intermediate object files that avoids the need to change the code. And it works!

But they encounter another anomaly: the Swift-based iOS app binary is big… too big. It’s so big that they’re running into the Apple cellular download limit.

For users who want to download the Uber app to their iPhones over the cellular network, Apple places a hard limit of 100MB on the size of the download: any bigger, and the phone won’t let you download it unless you’re on wifi. Once again, the Uber engineers are hitting a saturation point, only now the limit is space instead of time. To add insult to injury, their workaround to deal with the startup time problem exacerbated the size problem!

There are further workarounds they can do to save space, like replace structs with classes. But it isn’t enough. The data scientists run an experiment to estimate the cost to the organization of the app breaching the cellular download limit: and the risk of catastrophic. It turns out that many people download the app for the first time on a cellular network. The estimated cost to the business is orders of magnitude more than the cost of the rewrite.

The engineers have to make some hard choices. Their original plan was to bundle the old and new versions of the app in the same app bundle, so that they could do a slow rollout to reduce the blast radius if there was a problem with the new version. They are facing a goal conflict, and so they make a sacrifice judgment. They remove the old version of this app. They call this the “Yolo” release strategy.

They face another goal conflict: they can take advantage of a new capability in iOS 9 that will reduce the binary size by 50%, but to do so they have to drop support for iOS 8. They estimate that this will decision will have a dollar of eight figures. With only a week to go before release, they drop iOS 8 and eat the cost to come get under the cellular download limit.

The engineers believe that dropping iOS 8 support should provide them with enough headroom to figure out a strategy for dealing with the 100 MB download limit, given the project slowdown in the growth of the app. But their model of the growth rate is wrong: the app is growing too quickly. There’s a risk of decompensation, of not being able to work around the growth rate of the app.

And so the engineers adapt. They form a strike team to come up with approaches for bringing the app size under control. They employ workarounds such as deleting unused features, checking for expensive code patterns, and rewriting the Apple Watch app in Objective C.

An Uber engineer in the Amsterdam office comes up with an innovative work around: he uses an annealing algorithm to re-order the Swift compiler’s optimization passes to minimize the size of the resulting binary. And it works! It also terrifies the Swift compiler engineers, as they haven’t tested running the optimization passes in arbitrary orders.

And yet, the risk of decompensation is ever-present: the strike team worries about their space saving wins will not be able to keep pace with the growth of the applications.

Fortunately, Apple moves the boundary: increasing the cellular download limit to 150 MB and introducing new size optimization features in the Swift compiler.


The above is my retelling of a Twitter thread by McLaren Stanley, a former Uber engineer. I highly recommend reading the original thread in full. My writing above is based solely on that thread, I don’t have any additional information, and I probably got some stuff wrong. I also created a concept map based on Stanley’s thread.

I wrote the post above using the frame of what the researcher David Woods calls the adaptive universe. I tried to cast events in terms of people undergoing pressure, encountering risks of saturation, and then adapting in the face of that pressure, and those adaptations leading to reverberations that introduce unexpected change in the system. Woods calls these adaptive cycles.

I’ve previously written briefly about the adaptive universe, but to learn more about this model, check out this material by Woods:

Software as a limited medium

The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination. Few media of creation are so flexible, so easy to polish and rework, so readily capable of realizing grand conceptual structures.

Fred Brooks, The Mythical Man-Month

We software engineers don’t work in a physical medium the way, say, civil, mechanical, electrical, or chemical engineers do. Yes, our software does run on physical machines, and we are not exempt from dealing with limits. But, as captured in that Fred Brooks quote above, there’s a sense in which we software folk feel that we are working in a medium that is limited only by our own minds, by the complexity of these ethereal artifacts we create. When a software system behaves in an unexpected way, we consider it a design flaw: the engineer was not sufficiently smart.

And, yet, contra Brooks, software is a limited medium. Let’s look at two areas where that’s the case.

Software is discrete in a way that the world isn’t

We persist our data in databases that have schemas, which force us to slice up our information in ways that we can represent. But the real world is not so amenable to this type of slicing: it’s a messy place. The mismatch between the messiness of the real world and the structured nature of software data representations results in a medium that is not well-suited to model the way humans treat concepts such as names or time.

Software as a medium, and data storage in particular, encourages over-simplification of the world, because we need to categorize our data, figure out which tables to store it in and what values those columns should have, and so many items in the world just aren’t easy to model well like that.

As an example, consider a common question in my domain, software deployment: is a cluster up? We have to make a decision about that, and yet the answer is often “it depends: why do you want to know”? But that’s not what software as a medium encourages. Instead, we pick a definition of “up”, implement it, and then hope that it meets most needs, knowing it won’t. We can come up with other definitions for other circumstances, but we can’t be comprehensive, and we can’t be flexible. We have to bake in those assumptions.

And so, just like all engineers, given our time and resource constraints, we have to make over-simplifications to get our work done. William Kent wrote a whole book on this topic called Data and Reality: A Timeless Perspective on Perceiving and Managing Information in Our Imprecise World (h/t Hillel Wayne).

Software systems are limited in how they integrate inputs

In the book Problem Frames, Michael Jackson describes several examples of software problems. One of them is a system for counting how many cars pass by on a street. The inputs are two sensors that emit a signal when the cars drive over them. Those two sensors provide a lot less input than a human would have sitting by the side of the road and counting the cars go by.

As humans, when we need to make decisions, we can flexibly integrate a lot of different information signals. If I’m talking to you, for example, I can listen to what you’re saying, and I can also read the expressions on your face. I can make judgments based on how you worded your Slack message, and based on how well I already know you. I can use all of that different information to build a mental model of your actual internal state. Software isn’t like that: we have to hard-code, in advance, the different inputs that the software system will use to make decisions. Software as a medium is inherently limited in modeling external systems that it interacts with.

A couple of months ago, I wrote a blog post titled programming means never getting to say “it depends”, where I used the example of an alerting system: when do you alert a human operator of a potential problem? As humans, we can develop mental models of the human operator: “does the operator already know about X? Wait, I see that they are engaged based on their Slack messages, so I don’t need to alert them, they’re already on it.”

Good luck building an alerting system that constructs a model of the internal state of a human operator! Software just isn’t amenable to incorporating all of the possible signals we might get from a system.

Recognizing the limits of software

The lesson here is that there are limits to how well software system can actually perform, given the limits of software. It’s not simply a matter of managing complexity or avoiding design flaws: yes, we can always build more complex schemas to handle more cases, and build our systems to incorporate large input sets, but this is the equivalent of adding epicycles. Incorrect categorizations and incorrect automated decisions are inevitable, no matter how complex our systems become. They are inherent to the nature of software systems. We’re always going to need to have humans-in-the-loop to make up for these sorts of shortcomings.

The goal is not to build better software systems, but how to build better joint cognitive systems that are made up of humans and software together.

Top-down code reviews

I’ve long been frustrated by the task of code reviews. Often, the pull request (PR) I’m reviewing involves a part of the codebase I’m not intimately familiar with. I read it, not quite understanding it, looking to see if I can offer some sort of useful feedback, and typically that feedback would be on the micro level (e.g., “you can simplify this function by calling this other library function instead”).

I recently started experimenting with a new review approach that I’m going to call top down code review. Here’s how it works: I start by understanding the code well enough so that I can write my own version of the pull request message, describing the PR in my own words. After I’ve done this, then I provide feedback.

I call this approach “top down” because the review that I end up generating starts with a “top down” description of the PR: the problem it’s trying to solve, and the solution approach, before diving into describing notable implementation details. Here are the reviews I’ve done in this style so far:

I’ve been finding this approach useful because it forces me to come to terms with how well I really understand the PR. If I can’t explain the PR in my own words, then I don’t really understand it. It also helps me figure out what questions to ask the original author to help clarify things for me.

I also get more of a sense of closure after doing the review. Even if I had no feedback to give, I understand the changes in a way that I didn’t before.

Battleshorts, exaptations, and the limits of STAMP

A couple of threads got me thinking about the limits of STAMP.

The first thread was sparked by a link to a Hacker News comment, sent to me be a colleague of mine, Danny Thomas. This introduced me to a concept I hadn’t of heard of before, a battleshort. There’s even an official definition in a NATO document:

The capability to bypass certain safety features in a system to ensure completion of the mission without interruption due to the safety feature

AOP-38, Allied Ordnance Publication 38, Edition 3, Glossary of terms and definitions concerning the safety and suitability for service of munitions, explosives and related products, April 2002.

The second thread was sparked by a Twitter exchange between a UK Southern Railway train driver and the official UK Southern Railway twitter account:

This is a great example of exapting, a concept introduced by the paleontologists Stephen Jay Gould and Elisabeth Vrba. Exaptation is a solution to the following problem in evolutionary biology: what good is a partially functional wing? Either an animal can fly or it can’t, and a fully functional wing can’t evolve in a single generation, so how do the initial evolutionary stages of a wing confer advantage on the organism?

The answer is that: while a partially functional wing might be useless for flight, it might still be useful as a fin. And so, if wings evolved from fins, then the appendage may always confer an advantage at each evolutionary stage. The fin is exapted into a wing; it is repurposed to serve a new function. In the Twitter example above, the railway driver repurposed a social media service for communicating with his own organization.

Which brings us back to STAMP. One of the central assumptions of STAMP is that it is possible to construct an accurate enough control model of the system at the design stage to identify all of the hazards and unsafe control actions. You can see this assumption in action in the CAST handbook (CAST is STAMP’s accident analysis process) in the example questions from page 40 of the handbook (emphasis mine), which uses counterfactual reasoning to try identify flaws in the original hazard analysis.

Did the design account for the possibility of this increased pressure? If not, why not? Was this risk assessed at the design stage?

This seems like a predictable design flaw. Was the unsafe interaction between the two requirements (preventing liquid from entering the flare and the need to discharge gases to the flare) identified in the design or hazard analysis efforts? If so, why was it not handled in the design or in operational procedures? If it was not identified, why not?

Why wasn’t the increasing pressure detected and handled? If there were alerts, why did they not result in effective action to handle the increasing pressure? If there were automatic overpressurization control devices (e.g., relief valves), why were they not effective? If there were not automatic devices, then why not? Was it not feasible to provide them?

Was this type of pressure increase anticipated? If it was anticipated, then why was it not handled in the design or operational procedures? If it was not anticipated, why not?

Was there any way to contain the contents within some controlled area (barrier), at least the catalyst pellets?

Why was the area around the reactor not isolated during a potentially hazardous operation? Why was there no protection against catalyst pellets flying around?

This line of reasoning assumes that all hazards are, in principle, identifiable at the design stage. I think that phenomena like battleshorts and exaptations make this goal unattainable.

Now, in principle, nothing prevents an engineer using STPA (STAMP’s hazard analysis technique) from identifying scenarios that involve battleshorts and exaptations. After all, STPA is an exploratory technique. But I suspect that many of these kinds of adaptations are literally unimaginable to the designers.

SRE, CSE, and the safety boundary

Site reliability engineering (SRE) and cognitive systems engineering (CSE) are two fields seeking the same goal: helping to design, build, and operate complex, software-intensive systems that stay up and running. They both worry about incidents and human workload, and they both reason about systems in terms of models. But their approaches are very different, and this post is about exploring one of those differences.

Caveat: I believe that you can’t really understand a field unless you either have direct working experience, or you have observed people doing work in the field. I’m not a site reliability engineer or a cognitive systems engineer, nor have I directly observed SREs or CSEs at work. This post is an outsider’s perspective on both of these fields. But I think it holds true to the philosophies that these approaches espouse publicly. Whether it corresponds to the actual day-to-day work of SREs and CSEs, I will leave to the judgment of the folks on the ground who actually do SRE or CSE work.

A bit of background

Site reliability engineering was popularized by Google, and continues to be strongly associated with the company. Google has published three O’Reilly books, the first one in 2016. I won’t say any more about the background of SRE here, but there are many other sources (including the Google books) for those who want to know more about the background.

Cognitive systems engineering is much older, tracing its roots back to the early eighties. If SRE is, as Ben Treynor described it what happens when you ask a software engineer to design an operations function, then CSE is what happens when you ask a psychologist how to prevent nuclear meltdowns.

CSE emerged in the wake of the Three Mile Island accident of 1979, where researchers were trying to make sense of how the accident happened. Before Three Mile Island, research on "human factors" aspects of work had focused on human physiology (for example, designing airplane cockpits), but after TMI the focused expanded to include cognitive aspects of work. The two researchers most closely associated with CSE, Erik Hollnagel and David Woods, were both trained as psychology researchers: their paper Cognitive Systems Engineering: New wine in new bottles marks the birth of the field (Thai Wood covered this paper in his excellent Resilience Roundup newsletter).

CSE has been applied in many different domains, but I think it would be unknown in the "tech" community were it not for the tireless efforts of John Allspaw to popularize the results of CSE research that has been done in the past four decades.

A useful metaphor: Rasmussen’s dynamic safety model

Jens Rasmussen was a Danish safety researcher whose work remains deeply influential in CSE. In 1997 he published a paper titled Risk management in a dynamic society: a modelling problem. This paper introduced the metaphor of the safety boundary, as illustrated in the following visual model, which I’ve reproduced from this paper:

Rasmussen viewed a safety-critical system as a point that moves inside of a space enclosed by three boundaries.

At the top right is what Rasmussen called the "boundary to economic failure". If the system crosses this boundary, then the system will fail due to poor economic performance. We know that if we try to work too quickly, we sacrifice safety. But we can’t work arbitrarily slowly to increase safety, because then we won’t get anything done. Management naturally puts pressure on the system to move away from this boundary.

At the bottom right is what Rasmussen called the "boundary of unacceptable work load". Management can apply pressure on the workforce to work both safely and quickly, but increasing safety and increasing productivity both require effort on behalf of practitioners, and there are limits to the amount of work that people can do. Practitioners naturally put pressure on the system to move away from this boundary.

At the left, the diagram has two boundaries. The outer boundary is what Rasmussen called the "boundary of functionally acceptable performance", what I’ll call the safety boundary. If the system crosses this boundary, an incident happens. We can never know exactly where this boundary is. The inner boundary is labelled "resulting perceived boundary of acceptable performance". That’s where we think the boundary is, and where we try to stay away from.

SRE vs CSE in context of the dynamic safety model

I find the dynamic safety model useful because I think it illustrates the difference in focus between SRE and CSE.

SRE focuses on two questions:

  1. How do we keep the system away from the safety boundary?
  2. What do we do once we’ve crossed the boundary?

To deal with the first question, SRE thinks about issues such as how to design systems and how to introduce changes safely. The second question is the realm of incident response.

CSE, on the other hand, focuses on the following questions:

  1. How will the system behave near the system boundary?
  2. How should we take this boundary behavior into account in our design?

CSE focuses on the space near the boundary, both to learn how work is actually done, and to inform how we should design tools to better support this work. In the words of Woods and Hollnagel:

> Discovery is aided by looking at situations that are near the margins of practice and when resource saturation is threatened (attention, workload, etc.). These are the circumstances when one can see how the system stretches to accommodate new demands, and the sources of resilience that usually bridge gaps. – Joint Cognitive Systems: Patterns in Cognitive Systems Engineering, p37

Fascinatingly, CSE has also identified common patterns of system behavior at the boundary that holds across multiple domains. But that will have to wait for a different post.

Reading more about CSE

I’m still a novice in the field of cognitive systems engineering. I’m actually using these posts to help learn through explaining the concepts to others.

The source I’ve found most useful so far is the book Joint Cognitive Systems: Patterns in Cognitive Systems Engineering , which is referenced in this post. If you prefer videos, Cook’s Lectures on the study of cognitive work is excellent.

I’ve also started a CSE reading list.

How did software get so reliable without proof?

In 1996, the Turing-award-winning computer scientist C.A.R. Hoare wrote a paper with the title How Did Software Get So Reliable Without Proof? In this paper, Hoare grapples with the observation that software seems to be more reliable than computer science researchers expected was possible without the use of mathematical proofs for verification (emphasis added):

Twenty years ago it was reasonable to predict that the size and ambition of software products would be severely limited by the unreliability of their component programs … Dire warnings have been issued of the dangers of safety-critical software controlling health equipment, aircraft, weapons systems and industrial processes, including nuclear power stations … Fortunately, the problem of program correctness has turned out to be far less serious than predicted …

So the questions arise: why have twenty years of pessimistic predictions been falsified? Was it due to successful application of the results of the research which was motivated by the predictions? How could that be, when clearly little software has ever has been subjected to the rigours of formal proof?

Hoare offers five explanations for how software became more reliable: management, testing, debugging, programming methodology, and (my personal favorite) over-engineering.

Looking back on this paper, what strikes me is the absence of acknowledgment of the role that human operators play in the types of systems that Hoare writes about (health equipment, aircraft, weapons systems, industrial processes, nuclear power). In fact, the only time the word “operator” appears in the text is when it precedes the word “error” (emphasis added)

The ultimate and very necessary defence of a real time system against arbitrary hardware error or operator error is the organisation of a rapid procedure for restarting the entire system.

Ironically, the above line is the closest Hoare gets to recognizing the role that humans can play in keeping the system running.

The problem with the question “How did software get so reliable without proof?” is that it’s asking the wrong question. It’s not that software got so reliable without proof: it’s that systems that include software got so reliable without proof.

By focusing only on the software, Hoare missed the overall system. And whether you call them socio-technical systems, software-intensive systems, or joint cognitive systems, if you can’t see the larger system, you are doomed to not even be able to ask the right questions.

Contributors, mitigators & risks: Cloudflare 2019-07-02 outage

John Graham-Cumming, Cloudflare’s CTO, wrote a detailed writeup of a Cloudflare incident that happened on 2019-07-02. Here’s a categorization similar to the one I did for the Stripe outage.

Note that Graham-Cumming has a “What went wrong” section in his writeup where he explicitly enumerates 11 different contributing factors; I’ve sliced things a little differently here: I’ve taken some of those verbatim, reworded some of them, and left out some others.

All quotes from the original writeup are in italics.

Contributing factors

Remember not to think of these as “causes” or “mistakes”. They are merely all of the things that had to be true for the incident to manifest, or for it to be as severe as it was.

Regular expression lead to catastrophic backtracking

A regular expression used in a firewall engine rule resulted in catastrophic backtracking:

(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

Simulated rules run on same nodes as enforced rules

This particular change was to be deployed in “simulate” mode where real customer traffic passes through the rule but nothing is blocked. We use that mode to test the effectiveness of a rule and measure its false positive and false negative rate. But even in the simulate mode the rules actually need to execute and in this case the rule contained a regular expression that consumed excessive CPU.

Failure mode prevented access to internal services

But getting to the global WAF [web application firewall] kill was another story. Things stood in our way. We use our own products and with our Access service down we couldn’t authenticate to our internal control panel … And we couldn’t get to other internal services like Jira or the build system.

Security feature disables credentials for infrequent use for an operator interface

[O]nce we were back we’d discover that some members of the team had lost access because of a security feature that disables their credentials if they don’t use the internal control panel frequently

Bypass mechanisms not frequently used

And we couldn’t get to other internal services like Jira or the build system. To get to them we had to use a bypass mechanism that wasn’t frequently used (another thing to drill on after the event). 

WAF changes are deployed globally

The diversity of Cloudflare’s network and customers allows us to test code thoroughly before a release is pushed to all our customers globally. But, by design, the WAF doesn’t use this process because of the need to respond rapidly to threats … Because WAF rules are required to address emergent threats they are deployed using our Quicksilver distributed key-value (KV) store that can push changes globally in seconds

The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.

The fact that WAF changes can only be done globally exacerbated the incident by increasing the size of the blast radius.

WAF implemented in Lua, which uses PCRE

Cloudflare makes use of Lua extensively in production … The Lua WAF uses PCRE internally and it uses backtracking for matching and has no mechanism to protect against a runaway expression.

The regular expression engine being used didn’t have complexity guarantees.

Based on the writeup, it sounds like they used the PCRE regular expression library because PCRE is the regex library that ships with Lua, and Lua is the language they use to implement the WAF.

Protection accidentally removed by a performance improvement refactor

A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.

Cloudflare dashboard and API are fronted by the WAF

Our customers were unable to access the Cloudflare Dashboard or API because they pass through the Cloudflare edge.

Mitigators

Paging alert quickly identified a problem

At 13:42 an engineer working on the firewall team deployed a minor change to the rules for XSS detection via an automatic processThree minutes later the first PagerDuty page went out indicating a fault with the WAF. This was a synthetic test that checks the functionality of the WAF (we have hundreds of such tests) from outside Cloudflare to ensure that it is working correctly. This was rapidly followed by pages indicating many other end-to-end tests of Cloudflare services failing, a global traffic drop alert, widespread 502 errors and then many reports from our points-of-presence (PoPs) in cities worldwide indicating there was CPU exhaustion.

Engineers recognized high severity based on alert pattern

This pattern of pages and alerts, however, indicated that something gravely serious had happened, and SRE immediately declared a P0 incident and escalated to engineering leadership and systems engineering.

Existence of a kill switch

At 14:02 the entire team looked at me when it was proposed that we use a ‘global kill’, a mechanism built into Cloudflare to disable a single component worldwide.

Risks

Declarative program performance is hard to reason about

Regular expressions are examples of declarative programs (SQL is another good example of a declarative programming language). Declarative programs are elegant because you can specify what the computation should do without needing to specify how the computation can be done.

The downside is that it’s impossible to look at a declarative program and understand the performance implications, because there isn’t enough information in a declarative program to let you know how it will be executed! You have to be familiar with how the interpreter/compiler works to understand the performance implications of a declarative program. Most programmers probably don’t know how regex libraries are implemented.

Simulating in production environment

For rule-based systems, it’s enormously valuable for an engineer to be able to simulate what effect the rules will have before they’re put into effect, as it is generally impossible to reason about their impacts without doing simulation.

The more realistic the simulation is, the more confidence we have that the results of the simulation will correspond to the actual results when the rules are enabled in production.

However, there is always a risk of doing the simulation in the production environment, because the simulation is a type of change, and all types of change carry some risk.

Severe outages happen infrequently

[S]ome members of the team had lost access because of a security feature that disables their credentials if they don’t use the internal control panel frequently.

To get to [internal services] we had to use a bypass mechanism that wasn’t frequently used (another thing to drill on after the event). 

The irony is that when we only encounter severe outages infrequently, we don’t have the opportunity to exercise the muscles we need to use when these outages do happen.

Large blast radius

The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.

In the future, it sounds like non-emergency rule changes will be staged at Cloudflare. But the functionality will still exist for global changes, because it needs to be there for emergency rule changes. They can reduce the amount of changes that need to get pushed globally, but they can’t drive it down to zero. This is an inevitable risk tradeoff.

Questions

Why hasn’t this happened before?

You’re not generally supposed to ask “why” questions, but I can’t resist this one. Why did it take so long for this failure mode to manifest? Hadn’t any of the engineers at Cloudflare previously written a rule that used a regex with pathological backtracking behavior? Or was it that refactor that removed their protection from excessive CPU load in the case of regex backtracking?

What was the motivation for the refactor?

A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.

What was the reason they were trying to make the WAF use less CPU? Were they trying to reduce the cost by running on fewer nodes? Were they just trying to run cooler to reduce the risk of running out of CPU? Was there some other rationale?

What’s the history of WAF rule deployment implementation?

The SOP allowed a non-emergency rule change to go globally into production without a staged rollout. [emphasis added]

The WAF rule system is designed to support quickly pushing out rules globally to protect against new attacks. However, not all of the rules require quick global deployment. Yet, this was the only mechanism that the WAF system supported, even though code changes support staged rollout.

The writeup simply mentions this as a contributing factor, but I’m curious as to how the system came to be that way. For example, was it originally designed with only quick rule deployment in mind? Were staged code deploys introduced into Cloudflare only after the WAF system was built?

Other interesting notes

The WAF rule update was normal work

At 13:42 an engineer working on the firewall team deployed a minor change to the rules for XSS detection via an automatic process. 

Based on the writeup, it sounds like this was a routine change. It’s important to keep in mind that incidents often occur as a result of normal work.

Multiple sources of evidence to diagnose CPU issue

The Performance Team pulled live CPU data from a machine that clearly showed the WAF was responsible. Another team member used strace to confirm. Another team saw error logs indicating the WAF was in trouble.

It was interesting to read how they triangulated on high CPU usage using multiple data sources.

Normative language in the writeup

Emphasis added in bold.

We know how much this hurt our customers. We’re ashamed it happened.

The rollback plan required running the complete WAF build twice taking too long.

The first alert for the global traffic drop took too long to fire.

We didn’t update our status page quickly enough.

Normative language is one of the three analytical traps in accident investigation. If this was an internal writeup, I would avoid the language criticizing the rollback plan, the alert configuration, and the status page decision, and instead I’d ask questions about how these came to be, such as:

Was this the first time the rollback plan was carried out? (If so, that may explain the reason why it wasn’t known how long it would take).

Is the global traffic drop alert configuration typical of the other alerts, or different? If it’s different (i.e., other alerts fire faster?) what led to it being different? If it’s similar to other alert configurations, that would explain why it was configured to be “too long”.

Work to reduce CPU usage contributed to excessive CPU usage

A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.

These sorts of unintended consequences are endemic when making changes within complex systems. It’s an important reminder that the interventions we implement to prevent yesterday’s incidents from recurring may contribute to completely new failure modes tomorrow.