The carefulness knob

A play in one act

Dramatis personae

  • EM, an engineering manager
  • TL, the tech lead for the team
  • X, an engineering manager from a different team

Scene 1: A meeting room in an office. The walls are adorned with whiteboards with boxes and arrows.

EM: So, do you think the team will be able to finish all of these features by end of the Q2?

TL: Well, it might be a bit tight, but I think it should be possible, depending on where we set the carefulness knob.

EM: What’s the carefulness knob?

TL: You know, the carefulness knob! This thing.

TL leans over and picks a small box off of the floor and places it on the table. The box has a knob on it with numerical markings.

EM: I’ve never seen that before. I have no idea what it is.

TL: As the team does development, we have to make decisions about how much effort to spend on testing, how closely to hew to explicitly documented processes, that sort of thing.

EM: Wait, aren’t you, like, careful all of the time? You’re responsible professionals, aren’t you?

TL: Well, we try our best to allocate our effort based on what we estimate the risk to be. I mean, we’re a lot more careful when we do a database migration than we do when we fix a typo in the readme file!

EM: So… um… how good are you at actually estimating risk? Wasn’t that incident that happened a few weeks ago related to a change that was considered a low risk at the time?

TL: I mean, we’re pretty good. But we’re definitely not perfect. It certainly happens that we misjudge the risk sometimes. I mean, in some sense, isn’t every incident in some sense a misjudgment of risk? How many times do we really say, “Hoo boy, this thing I’m doing is really risky, we’re probably going to have an incident!” Not many.

EM: OK, so let’s turn that carefulness knob up to the max, to make sure that the team is careful as possible. I don’t want any incidents!

LM: Sounds good to me! Of course, this means that we almost certainly won’t have these features done by the end of Q2, but I’m sure that the team will be happy to hear…

EM: What, why???

TL picks up a marker off of the table and walks up to the whiteboard. She draws an x axis and y-axis. She labels the x-axis “carefulness” and the y-axis “estimated completion time”.

TL: Here’s our starting point: the carefulness knob is currently set at 5, and we can properly hit end of Q2 if we keep it at this setting.

EM: What happens if we turn up the knob?

TL draws an exponential curve.

EM: Woah! That’s no good. Wait, if we turn the carefulness knob down, does that mean that we can go even faster?

TL: If we did that, we’d just be YOLO’ing our changes, not doing validation. Which means we’d increase the probability of incidents significantly, which end up taking a lot of time to deal with. I don’t think we’d actually end up delivering any faster if we chose to be less careful than we normally are.

EM: But won’t we also have more incidents at a carefulness setting of 5 than at higher carefulness settings?

TL: Yes, there’s definitely more of a risk that a change that we incorrectly assess as low risk ends up biting us at our default carefulness level. It’s a tradeoff we have to make.

EM: OK, let’s just leave the carefulness knob at the default setting.


Scene 2: An incident review meeting, two and a half months later.

X: We need to be more careful when we make these sorts of changes in the future!

Fin


Coda

It’s easy to forget that there is a fundamental tradeoff between how careful we can be and how much time it will take us to perform a task. This is known as the efficiency-thoroughness trade-off, or ETTO principle.

You’ve probably hit a situation where it’s particularly difficult to automate the test for something, and doing the manual testing is time-intensive, and you developed the feature and tested it, but then there was a small issue that you needed to resolve, and then do you go through all of the manual testing again? We make these sort of time tradeoffs in the small, they’re individual decisions, but they add up, and we’re always under schedule pressure to deliver.

As a result, we try our best to adapt to the perceived level of risk in our work. The Human and Organizational Performance folks are fond of the visual image of the black line versus the blue line to depict the difference between how the work is supposed to be done with how workers adapt to get their work done.

But sometimes these adaptations fail. And when this happens, inevitably someone says “we need to be more careful”. But imagine if you explicitly asked that person at the beginning of a project about where they wanted to set that carefulness knob, and they had to accept that increasing the setting would increase the schedule significantly. If an incident happened, you could then say to them, “well, clearly you set the carefulness knob too low at the beginning of this project”. Nobody wants to explicitly make the tradeoff between less careful and having a time estimate that’s seen as excessive. And so the tradeoff gets made implicitly. We adapt as best we can to the risk. And we do a pretty good job at that… most of the time.

If you don’t examine what worked, how will you know what works?

This is one of my favorite bits from fellow anglophone Québécois Norm McDonald:

Norm: not a lung expert

One of the goals I believe that we all share for post-incident work is to improve the system. For example, when I wrote the post Why I don’t like discussing action items during incident reviews, I understood why people would want to focus on action items: precisely because they share this goal of wanting to improve the system (As a side note, Chris Evans of incident.io wrote a response: Why I like discussing actions items in incident reviews). However, what I want to write about here is not the discussion of action items, but focusing on what went wrong versus what went right.

“How did things go right?”

How did things go right is a question originally posed by the safety researcher Erik Hollnagel, in his the safety paradigm that he calls Safety-II. The central idea is that things actually go right most of the time, and if you want to actually improve the system, you need to get a better understanding of how the system functions, which means you need to broaden your focus beyond the things that broke.

You can find an approachable introduction to Safety-II concepts in the EUROCONTROL white paper From Safety-I to Safety-II. Hollnagel’s ideas have been very influential in the resilience engineering community. As an example, check out my my former colleague Ryan Kitchens’s talk at SREcon Americas 2019: How Did Things Go Right? Learning More from Incidents.

It’s with this how did things go right lens that I want to talk a little bit about incident review.

Beyond “what went well”

Now, in most incident writeups that I’ve read, there is a “what went well” section. However, it’s typically the smallest section in the writeup, with maybe a few bullet points: there’s never any real detail there.

Personally, I’m looking for details like how an experienced engineer recognized the symptoms enough to get a hunch about where to look next, reducing the diagnostic time by hours. Or how engineers leveraged an operational knob that was originally designed for a different purpose. I want to understand how experts are able to do the work of effectively diagnosing problems, mitigating impact, and remediating the problem.

Narrowly, I want to learn this because I want to get this sort of working knowledge into other people’s heads. More broadly, I want to bring to light the actual work that gets done.

We don’t know how the system works

Safety researchers make a distinction between work-as-imagined and work-as-done. We think we understand how the day-to-day work gets done, but we actually don’t. Not really. To take an example from software, we don’t actually know how people really use the tooling to get their work done, and I can confirm this by being on-call for internal support for development tools in previous jobs. (“You’re using our tool to do what?” is not an uncommon reaction from the on-call person). People do things we never imagined, in both wonderful and horrifying ways (sometimes at the same time!).

We also don’t see all of the ways that people coordinate to get their work done. There are the meetings, the slack messages, the comments on the pull requests, but there’s also the shared understanding, the common knowledge, the stuff that everybody knows that everybody else knows, that enables people to get this work done, while reducing the amount of explicit communication that has to happen.

What’s remarkable is that these work patterns, well, they work. These people in your org are able to get their stuff done, almost all of the time. Some of them may exhibit mastery of the tooling, and others may use the tooling in ways even it was never intended that are fundamentally unsafe. But we’re never going to actually know unless we actually look at how they’re doing their work.

Because how people do their work is how the system works. And if we’re going to propose and implement interventions, it’s very likely that the outcomes of the interventions will surprise us, because these changes might disrupt effective ways of doing work, and people will adapt to those interventions in ways we never anticipated, and in ways we may never even know if we don’t take a look.

Then why use incidents to look at things that go right?

At first glance, it does seem odd to use incidents as the place to examine where work goes well: given that incidents are times where something unquestionably went wrong. It would be wonderful if we could study how work happens when things are going well. Heck, I’d love to see companies have sociologists or anthropologists on staff to study how the work happens at the company. Regrettably, though, incidents are one of the only times when the organization is actually willing to devote resources (specifically, time) on examining work in fine-grained detail.

We can use incidents to study how things go well, but we have to keep a couple of things in mind. One, we need to recognize that adaptations that fail led to an incident are usually successful, which is why people developed those adaptations. Note that because an adaptation usually works, doesn’t mean that it’s a good thing to keep doing: an adaptation could be a dangerous workaround to a constraint like a third-party system that can’t be changed directly and so must be awkwardly worked around.

Second, we need to look in more detail, to remark, at incident response that is remarkable. When incident response goes well, there is impressive diagnostic, coordination, and improvisation work to get the system back to healthy. These are the kinds of skills you want to foster across your organization. If you want to build tools to make this work even better, you should take the time to understand just how this work is done today. Keep this in mind when you’re proposing new interventions. After all, if you don’t examine what worked, how will you know what works?

Why I don’t like discussing action items during incident reviews

I’m not a fan of talking about action items during incident reviews.

Judging from the incident review meetings I’ve attended throughout my career, this is a minority view, and I wanted to elaborate here on why I think this way. For more on this topic, I encourage readers to check out John Allspaw’s 2016 blog post entitled Etsy’s Debriefing Facilitation Guide for Blameless Postmortems, as well as the Etsy Debrief Facilitation Guide itself. Another starting point I will shamelessly recommend is Resilience engineering: where do I start?

Incident reviews

First, let’s talk about what an incident review is. It’s a meeting that takes place not too long after an incident has occurred, to discuss the incident. In many organizations, these meetings are open to any employee interested in attending, which means that these can have potentially large and varied audiences.

I was going to write “the goal of an incident review is…” in the paragraph above, but the whole purpose of this post is to articulate how my goals differ from other people’s goals.

My claims

Nobody fully understands how the system works. Once a company reaches a certain size, the software needs to get broken up across different teams. Ideally, the division is such that the teams are able to work relatively independent of each other. These are well-defined abstractions that lead to low coupling that we all prize in large-scale systems. As a consequence, there’s no single person who actually fully understands how the whole system works. It’s just too large and complex. And this actually understates the problem, given the complexity of the platforms we build on top of. Even if I’m the sole developer of a Java application, there’s a good chance that I don’t understand the details of the garbage collection behavior of the JVM I’m using.

The gaps in our understanding of how the system works contributes to incidents. Because we don’t have a full understanding of how the system works, we can’t ever fully reason about the impact of every single change that we make. I’d go so far as to say that, in every single incident, there’s something important that somebody didn’t know. That means that gaps in our understanding are dangerous in addition to being omnipresent.

The way that work is done profoundly affects incidents, both positively and negatively, but that work is mostly invisible. Software systems are socio-technical systems, and the work that the people in your organization do every day is part of how the system works. This day-to-day work enables, trigger, exacerbate, prevent, lessen, and remediate incidents. And sometimes the exact same work in one context will prevent an incident and in another context will enable an incident! However, we generally don’t see what the real work is like. I’m lucky if my teammates have any sense of what my day-to-day work looks like, including how I use the internal tools to accomplish this work. The likelihood that people on other teams know how I do this work is close to zero. Even the teams that maintain the internal tooling have few opportunities to see this work directly.

Incident reviews are an opportunity for many people to gain insight into how the system works. An incident review is an opportunity to examine an aspect of the socio-technical system in detail. It’s really the only meeting of its kind where you can potentially have such a varied cross-section of the company getting into the nitty-gritty details of how things work. Incident reviews give us a flashlight that we get to shine on a dark corner of the system.

The best way to get a better understanding of how the system behaves is to look at how the system actually behaved. This phrasing should sound obvious, but it’s the most provocative of these claims. Every minute you spend discussing action items is a minute you are not spending learning more about how the system behaved. I feel similarly about discussing counterfactuals (if there had been an alert…). These discussions take the focus away from how the system actually behaved, and enter a speculative world about how the system might behave under a different set of circumstances.

We don’t know what other people don’t know We all have incomplete, out-of-date models of how the system works, that includes our models of other people’s models! That means that, in general, we don’t know what other people don’t know about the system. We don’t know in advance what people are going to learn that they didn’t know before!

There are tight constraints on incident review meetings. There is a fixed amount of time in an incident review meeting, which means that every minute spend on topic X means one less minute to spend discussing topic Y. Once that meeting is over, the opportunity of bringing in this group of people together to update their mental models is now gone.

Action item discussions are likely to be of interest to a smaller fraction of the audience. This is a very subjective observation, but my theory is that people tend to find that incident reviews don’t have a lot of value precisely because they focus too much of the time on discussing action items, and the details of the proposed action items are of potential interest to only a very small subset of the audience.

Teams are already highly incentivized to implement action items that prevent recurrence. Often I’ll go to an incident review, and there will be mention of multiple action items that have already been completed. As an observer, I’ve never learned anything from hearing about these.

A learning meeting will never happen later, but action items discussion will. There’s no harm in having an action item discussion in a future meeting. In fact, teams are likely to have to do this when they do their planning work for the next quarter. However, once the incident review meeting is over, the opportunity for having a learning-style meeting is gone, because the org’s attention is gone and off to the next thing.

More learning up-front will improve the quality of action items. The more you learn about the system, the better your proposed action items are likely to be. But the reverse isn’t true.

Why not do both learning and action items during an incident review?

Hopefully the claims above address the question of why not do both activities. There’s a finite amount of time in an incident review meeting, which means there’s a fundamental tradeoff between time spent learning and time spent discussing action items, and I believe that devoting the entire time to learning will maximize the return-on-investment of the meeting. I also believe that additional action item discussions are much more likely to be able to happen after the incident review meeting, but that learning discussions won’t.

Why I think people emphasize action items

Here’s my mental model as to why I think people are so keen on emphasizing action items as the outcome of a meeting.

Learning is fuzzy, actions are concrete. An incident review meeting is an expensive meeting for an organization. Action items are a legible outcome of a meeting, they are an indicator to the organization that the meeting had value. The value of learning, of updated mental models, is invisible.

Incidents make orgs uncomfortable and action items reassure them. Incidents are evidence that we are not fully in control of our system, and action items make us feel like this uncomfortable uncertainty has been addressed.

Safety first!

I’m sure you’ve heard the slogan “safety first”. It is a statement of values for an organization, but let’s think about how to define what it should mean explicitly. Here’s how I propose to define safety first, in the context of a company. I’ll assume the company is in the tech (software) industry, since that’s the one I know best. So, in this context, you can think of “safety” as being about avoiding system outages, rather than about, say, avoiding injuries on a work site.

Here we go:


A tech company is a safety first company if any engineer has the ability to extend a project deadline, provided that the engineer judges in the moment that they need additional time in order to accomplish the work more safely (e.g., by following an onerous procedure for making a change, or doing additional validation work that is particular time-intensive).

This ability to extend the deadline must be:

  1. automatic
  2. unquestioned
  3. consequence-free

Automatic. The engineer does not to explicitly ask someone else for permission before extending the deadline.

Unquestioned. Nobody is permitted to ask the engineer “why did you extend the deadline?” after-the-fact.

Consequence-free. This action cannot be held against the engineer. For example, it cannot be a factor in a performance review.


Now, anyone who has worked in management would say to me, “Lorin, this is ridiculous. If you give people the ability to extend deadlines without consequence, then they’re just going to use this constantly, even if there isn’t any benefit to safety. It’s going to drastically harm the organization’s ability to actually get anything done”.

And, the truth is, they’re absolutely right. We all work under deadlines, and we all know that if there was a magical “extend deadline” button that anyone could press, that button would be pressed a lot, and not always for the purpose of improving safety. Organizations need to execute, and if anybody could introduce delays, this would cripple execution.

But this response is exactly the reason why safety first will always be a lie. Production pressure is an unavoidable reality for all organizations. Because of this, the system will always push back against delays, and that includes delays for the benefit of safety. This means engineers will always face double binds, where they will feel pressure to execute on schedule, but will be punished if they make decisions that facilitate execution but reduce safety.

Safety is never first in organization: it’s always one of a number of factors that trade off against each other. And those sorts of tradeoff decisions happen day-to-day and moment-to-moment.

Remember that the next time someone is criticized for “not being careful enough” after a change brings down production.

The “CrowdStrike” approach to reliability work

There’s a lot we simply don’t know about how reliability work was prioritized inside of CrowdStrike, but I’m going to propose a little thought experiment about the incident where I make some assumptions.

First, let’s assume that the CrowdStrike incident was the first time they had an incident that was triggered by a Rapid Response Content update, which is a config update. We’ll assume that previous sensor issues that led to Linux crashes were related to a sensor release, which is a code update.

Next, let’s assume that CrowdStrike focuses their reliability work on addressing the identified root cause of previous incidents.

Finally, let’s assume that none of the mitigations documented in their RCA were identified as action items that addressed the root cause of any incidents they experienced before this big one.

If these three assumptions are true, then it explains why these mitigations weren’t done previously. Because they didn’t address the root causes of previous incidents, and they focus their post-incident work on identifying the root cause of previous incidents. Now, I have no idea if any of these assumptions are actually true, but they sound plausible enough for this thought experiment to hold.

This thought experiment demonstrates the danger of focusing post-incident work on addressing the root causes of previous incidents: it acts to obscure other risks in the system that don’t happen to fit into the root cause analysis. After all, issues around validation of the channel files or staging of deploys were not really the root cause of any of the incidents before this one. The risks that are still in your system don’t care about what you have labeled “the real root cause” of the previous incident, and there’s no reason to believe that whatever gets this label is the thing that is most likely to bite you in the future.

I propose (cheekily) to refer to the prioritize-identifying-and-addressing-the-root-causeof-previous-incidents thinking as the “CrowdStrike” approach to reliability work.

I put “CrowdStrike” in quotes because, in a sense, this really isn’t about them at all: I have no idea if the assumptions in this thought experiment are true. But my motivation for using this phrase is more about using CrowdStrike as a symbol that’s become salient to our industry then about the particular details of that company.

Are you on the lookout for the many different signals of risk in your system, or are you taking the “CrowdStrike” approach to reliability work?

CrowdStrike: how did we get here?

CrowdStrike has released their final (sigh) External Root Cause Analysis doc. The writeup contains some more data on the specific failure mode. I’m not going to summarize it here, mostly because I don’t think I’d add any value in doing so: my knowledge of this system is no better than anyone else reading the report. I must admit, though, that I couldn’t helping thinking of number eleven in Alan Perlis’s epigrams in programming.

If you have a procedure with ten parameters, you probably missed some.

What I wanted to do instead with this blog is call out the last two of the “findings and mitigations” in the doc:

  • Template Instance validation should expand to include testing within the Content Interpreter
  • Template Instances should have staged deployment

This echos the chorus of responses I heard online in the aftermath of the outage. “Why didn’t they test these configs before deployment? How could they not stage their deploys?”

And this is my biggest disappointment with this writeup: it doesn’t provide us with insight into how the system got to this point.

Here are the types of questions I like to ask to try to get at this.

Had a rapid response content update ever triggered a crash before in the history of the company? If not, why do you think this type of failure (crash related to rapid response content) has never bitten the company before? If so, what happened last time?

Was there something novel about the IPC template type? (e.g., was this the first time the reading of one field was controlled by the value of another?)

How is generation of the test template instances typically done? Was the test template instance generation here a typical case or an exceptional one? If exceptional, what was different? If typical, how come it has never led to problems before?

Before the incident, had customers ever asked for the ability to do staged rollouts? If so, how was that ask prioritized relative to other work?

Was there any planned work to improve reliability before the incident happened? What type of work was planned? How far along was it? How did you prioritize this work?

I know I’m a broken record here, but I’ll say it again. Systems reach the current state that they’re in because, in the past, people within the system made rational decisions based on the information they had at the time, and the constraints that they were operating under. The only way to understand how incidents happen is to try and reconstruct the path that the system took to get here, and that means trying to as best as you can to recreate the context that people were operating under when they made those decisions.

In particular, availability work tends to go to the areas where there was previously evidence of problems. That tends to be where I try to pick at things. Did we see problems in this area before? If we never had problems in this area before, what was different this time?

If we did see problems in the past, and those problems weren’t addressed, then that leads to a different set of questions. There are always more problems than resources, which means that orgs have to figure out what they’re going to prioritize (say “quarterly planning” to any software engineer and watch the light fade from their eyes). How does prioritization happen at the org?

It’s too much to hope for a public writeup to ever give that sort of insight, but I was hoping for something more about the story of “How we got here” in their final writeup. Unfortunately, it looks like this is all we get.

Incidents as keys to a lot of value

One of the workhorses of the modern software world is the key-value store. there are key-value services such as Redis or Dynamo, and some languages build key-value data structures right in to the language (examples include Go, Python and Clojure). Even relational databases, which are not themselves key-value stores, are frequently built on top of a data structure that exposes a key-value interface.

Key-value data structures are often referred to as maps or dictionaries, but here I want to call attention to the less-frequently-used term associative array. This term evokes the associative nature of the store: the value that we store is associated with the key.

When I work on an incident writeup, I try to capture details and links to the various artifacts that the incident responders used in order to diagnose and mitigate the incident. Examples of such details include:

  • dashboards (with screenshots that illustrate the problem and links to the dashboard)
  • names of feature flags that were used or considered for remediation
  • relevant Slack channels where there was coordination (other than the incident channel)

Why bother with these details? I see it as a strategy to tackle the problem that Woods and Cook refer to in Perspectives on Human Error: Hindsight Biases and Local Rationality as the problem of inert knowledge. There may be information about that we learned at some point, but we can’t bring it to bear when an incident is happening. However, we humans are good at remembering other incidents! And so, my hope is that when an operational surprise happens, someone will remember “Oh yeah, I remember reading about something like this when incident XYZ happened”, and then they can go look up the incident writeup to incident XYZ and see the details that they need to help them respond.

In other words, the previous incidents act as keys, and the content of the incident write-ups act as the value. If you make the incident write-ups memorable, then people may just remember enough about them to look up the write-ups and page in details about the relevant tools right when they need them.

Second-class interactions are a first-class risk

Below is a screenshot of Vizceral, a tool that was built by a former teammate of mine at Netflix. It provides a visualization of the interactions between the various microservices.

Vizceral uses moving dots to depict how requests are currently flowing through the Netflix microservice architecture. Vizceral is able to do its thing because of the platform tooling, which provides support for generating a visualization like this by exporting a standard set of inter-process communication (IPC) metrics.

What you don’t see depicted here are the interactions between those microservices and the telemetry platform that ingest these metrics. There’s also logging and tracing data, and those get shipped off-box via different channels, but none of those channels show up in this diagram.

In fact, this visualization doesn’t represent interactions with any of the platform services. You won’t see bubbles that represent the compute platform or the CI/CD platform represented in a diagram like this, even though those platform services all interact with these application services in important ways.

I call the first category of interactions, the ones between the application services, as first-class, and the second category, the ones where the interactions involve platform services, as second-class. It’s those second-class interactions that I want to say more about.

These second-class interactions tend to have a large blast radius, because successful platforms by their nature have a large blast radius. There’s a reason why there’s so much havoc out in the world when AWS’s us-east-1 region has a problem: because so many services out there are using us-east-1 as a platform. Similarly, if you have a successful platform within your organization, then by definition it’s going to see a lot of use, which means that if it experiences a problem, it can do a lot of damage.

These platforms are generally more reliable than the applications that run atop them, because they have to be: platforms naturally have higher reliability requirements than the applications that run atop them. They have these requirements because they have a large blast radius. A flaky platform is a platform that contributes to multiple high-severity outages, and systems that contribute to multiple high-severity outages are the systems were reliability work gets prioritized.

And a reliable system is a system whose details you aren’t aware of, because you don’t need to be. If my car is very reliable, then I’m not going to build an accurate mental model of how my car works, because I don’t need to: it just works. In her book Human-Machine Reconfigurations: Plans and Situated Actions, the anthropologist Lucy Suchman used the term representation to describe the activity of explicitly constructing a mental model of how a piece of technology works, and she noted that this type of cognitive work only happens when we run into trouble. As Suchman puts it:

[R]epresentation occurs when otherwise transparent activity becomes in some way problematic

Hence the irony: these second-class interactions tend not to be represented in our system models when we talk about reliability, because they are generally not problematic.

And so we are lulled into a false sense of security. We don’t think about how the plumbing works, because the plumbing just works. Until the plumbing breaks. And then we’re in big trouble.

Expect it most when you expect it least

Homer Simpson saying "It's probably the person you least suspect."
Homer Simpson: philosopher

Yesterday, CrowdStrike released a Preliminary Post-Incident Review of the major outage that happened last week. I’m going to wait until the full post-incident review is released before I do any significant commentary, and I expect we’ll have to wait at least a month for that. But I did want to highlight one section of the doc from the section titled What Happened on July 19, 2024, emphasis mine

On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.

And now, let’s reach way back into the distant past of three weeks ago, when the The Canadian Radio-television and Telecommunications Commission (CRTC) posted an executive summary of a major outage, which I blogged about at the time. Here’s the part I want to call attention to, once again, emphasis mine.

Rogers had initially assessed the risk of this seven-phased process as “High.” However, as changes in prior phases were completed successfully, the risk assessment algorithm downgraded the risk level for the sixth phase of the configuration change to “Low” risk, including the change that caused the July 2022 outage.

In both cases, the engineers had built up confidence over time that the types of production changes they were making were low risk.

When we’re doing something new with a technology, we tend to be much more careful with it, it’s riskier, we’re shaking things out. But, over time, after there haven’t been any issues, we start to gain more trust in the tech, confident that it’s a reliable technology. Our internal perception of the risk adjusts based on our experience, and we come to believe that the risks of these sorts of changes are low. Any organization who concentrates their reliability efforts on action items in the wake of an incident, rather than focusing on the normal work that doesn’t result in incidents, is implicit making this type of claim. The squeaky incident gets the reliability grease. And, indeed, it’s rational to allocate your reliability effort based on your perception of risk. Any change can break us, but we can’t treat every change the same. How could it be otherwise?

The challenge for us is that large incidents are not always preceded by smaller ones, which means that there may be risks in your system that haven’t manifested as minor outages. I think these types of risks are the most dangerous ones of all, precisely because they’re much harder for the organization to see. How are you going to prioritize doing the availability work for a problem that hasn’t bitten you yet, when your smaller incidents demonstrate that you have been bitten by so many other problems?

This means that someone has to hunt for weak signals of risk and advocate for doing the kind of reliability work where there isn’t a pattern of incidents you can point to as justification. The big ones often don’t look like the small ones, and sometimes the only signal you get in advance is a still, small sound.

Quick takes on Rogers Network outage executive summary

The Canadian Radio-television and Telecommunications Commission (CRTC) has posted an executive summary of a report on a major telecom outage that happened in 2022 to Rogers Communications, which is one of the major Canadian telecom companies.

The full report doesn’t seem to be available yet, and I’m not sure if it ever will be publicly released (edit: the full report was released on November 2024). I recommend you read the executive summary, but here are some quick impressions of mine.

Note that I’m not a network engineer (I’ve only managed a single rack of servers in my time), so I don’t have any domain expertise here.

Migration!

When you hear “large-scale outage”, a good bet is that involved a migration. The language of the report describes it as an upgrade, but I suspect this qualifies as a migration.

In the weeks leading to the day of the outage on 8 July 2022, Rogers was executing on a seven-phase process to upgrade its IP core network. The outage occurred during  the sixth phase of this upgrade process.

I don’t know anything about what’s involved a telecom upgrading its IP core network, but I do have a lot of general opinions about migrations, and I’m willing to bet they apply here as well.

I think of migrations as high-impact, bespoke changes that the system was not originally designed to accommodate.

They’re high-impact because things can go quite badly if something goes wrong. If you’ve worked at a larger company, you’ve probably experienced migrations that seem to take forever, and this is one of the reasons why: there’s a lot of downside risk in doing migration work (and often not much immediate upside benefit for the people who have to do the work, but that’s a story for another day).

Migrations are bespoke in the sense that each migration is a one-off. This makes migrations even more dangerous because:

  • The organization doesn’t have any operational muscles around doing any particular migration, because each one is new.
  • Because each migration is unique, it’s not worth the effort to build tooling to support doing the migration. And even if you build tools, those tools will always be new, which means they haven’t been hardened through production use.

There’s a reason why you hear about continuous integration and continuous delivery but not continuous migration, even though every org past a certain age will have multiple migrations in flight.

Finally, migrations are changes that the system was not originally designed to accommodate. In my entire career, during the design of a new system, I have never heard anyone ask, “How are we going to migrate off of this new system at the end of its life?” We just don’t design for migrating off of things. I don’t even know if it’s possible to do so.

Saturation!

Rogers staff removed the Access Control List policy filter from the configuration of the distribution routers. This consequently resulted in a flood of IP routing information into the core network routers, which triggered the outage. The flood of IP routing data from the distribution routers into the core routers exceeded their capacity to process the information. The core routers crashed within minutes from the time the policy filter was removed from the distribution routers configuration. When the core network routers crashed, user traffic could no longer be routed to the appropriate destination. Consequently, services such as mobile, home phone, Internet, business wireline connectivity, and 9-1-1 calling ceased functioning.

Saturation is a term from resilience engineering which refers to a system receiving being pushed to the limit of the amount of load that it can handle. It’s remarkable how many outages in distributed systems are related to some part of the system being overloaded, or hitting a rate limit, or exceeding some other limit. (For example, see Slack’s Jan 2021 outage). This incident is another textbook example of a brittle system, which falls over when it becomes saturated.

Perception of risk

I mentioned earlier that migrations are risky, and everyone knows migrations are risky. Roger engineers knew that as well:

Rogers had initially assessed the risk of this seven-phased process as “High.”

Ironically, the fact that the migration had gone smoothly up until that point led them to revise their risk assessment downwards.

However, as changes in prior phases were completed successfully, the risk assessment algorithm downgraded the risk level for the sixth phase of the configuration change to “Low” risk, including the change that caused the July 2022 outage.

I wrote about this phenomenon in a previous post, Any change can break us, but we can’t treat every change the same. The engineers gained confidence as they progressed through the migration, and things went well. Which is perfectly natural. In fact, this is one of the strengths of the continuous delivery approach: you build enough confidence that you don’t have to babysit every single deploy anymore.

But the problem is that we can never perfectly assess the risk in the system. And no matter how much confidence we build up, that one change that we believe is safe can end up taking down the whole system.

I should note that the report is pretty blame-y when it comes to this part:

 Downgrading the risk assessment to “Low” for changing the Access Control List filter in a routing policy contravenes industry norms, which require high scrutiny for such configuration changes, including laboratory testing before deploying in the production network.

I wish I had more context here. How did it make sense to them at the time? What sorts of constraints or pressures were they under? Hopefully the full report reveals more details.

Cleanup

Rogers staff deleted the policy filter that prevented IP route flooding in an effort to clean up the configuration files of the distribution routers. 

Cleanup work has many of the same risks as migration work: it’s high-impact and bespoke. Say “cleanup script” to an SRE and watch the reaction on their face.

But not cleaning up is also a risk! The solution can’t be “never do cleanup” in the same way it can’t be “never do migrations”. Rather, we need to recognize that this work always involve risk trade-offs. There’s no safe path here.

Failure mode makes incident response harder

At the time of the July 2022 outage, Rogers had a management network that relied on the Rogers IP core network. When the IP core network failed during the outage, remote Rogers employees were unable to access the management network. …

Rogers staff relied on the company’s own mobile and Internet services for connectivity to communicate among themselves. When both the wireless and wireline networks failed, Rogers staff, especially critical incident management staff, were not able to communicate effectively during the early hours of the outage. 

When an outage affects not just your customers but also your engineers doing incident response, life gets a whole lot harder.

This brings to mind the Facebook outage from 2021:

[A]s our engineers worked to figure out what was happening and why, they faced two large obstacles: first, it was not possible to access our data centers through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this. 

Component substitution fallacy

The authors point out that the system was not designed to handle this sort of overload:

Absence of router overload protection.  The July 2022 outage exposed the absence of overload protection on the core network routers. The network failure could have been prevented had the core network routers been configured with an overload limit that specifies the maximum acceptable number of IP routing data the router can support. However, the Rogers core network routers were not configured with such overload protection mechanisms. Hence, when the policy filter was removed from the distribution router, an excessive amount of routing data flooded the core routers, which led them to crash.

This is a great example of the component substitution fallacy, which fails to acknowledge explicit trade-offs that are made within organizations about which parts of the system to work on. Note that the Rogers engineers will certainly build in router overload protection now, but it means that’s engineering effort that won’t be spent building protections against other failure modes that haven’t happened yet.

Acknowledging trade-offs

To the authors’ credit, they explicitly acknowledge the tradeoffs involved in the overall design of the system.

The Rogers network is a national Tier 1 network and is architecturally designed for reliability; it is typical of what would be expected of such a Tier 1 service provider network. The July 2022 outage was not the result of a design flaw in the Rogers core network architecture. However, with both the wireless and wireline networks sharing a common IP core network, the scope of the outage was extreme in that it resulted in a catastrophic loss of all services. Such a network architecture is common to many service providers and is an example of the trend of converged wireline and wireless telecom networks. It is a design choice by service providers, including Rogers, that seeks to balance cost with performance.

I really hope the CRTC eventually releases the full report, I’m looking forward to reading it.