On variability

I was listening to Todd Conklin’s Pre-Accident Investigation Podcast the other day, to the episode titled When Normal Variability Breaks: The ReDonda Story. The name ReDonda in the title refers to ReDonda Vaught, an American registered nurse. In 2017, she was working at the Vanderbilt University Medical Center in Nashville when she unintentionally administered the wrong drug to a patient under her care, a patient who later died. Vaught was fired, then convicted by the state of Tennessee for criminally negligent homicide and abuse of an impaired adult. It’s a terrifying story, really a modern tale of witch-burning, but it’s not what this post is about. Instead, I want to home in a term from the podcast title: normal variability.

In the context of the field of safety, the term variability refers to how human performance is, well, variable. We don’t always do the work the exact same way. This variation happens between humans, where different people will do work in different ways. And the variation also happens within humans, the same person will perform a task differently over time. The sources of variation in human performance are themselves varied: level of experience, external pressures being faced by the person, number of hours of sleep the night before, and so on.

In the old view of safety, there is an explicitly safe way to perform the work, as specified in documented procedures. Follow the procedures, and incidents won’t happen. In the software world, these procedures might be: write unit tests for new code, have the change reviewed by a peer, run end-to-end tests in staging, and so on. Under this view of the world, variability is necessarily a bad thing. Since variability means people do work differently, and since safety requires doing work the proscribed way, human variability is a source of incidents. Traditional automation doesn’t have this variability problem: it always does the work the same way. Hence you get the old joke:

The factory of the future will have only two employees: a man and a dog. The man will be there to feed the dog. The dog will be there to keep the man from touching the equipment.

In the new view of safety, normal variability is viewed as an asset rather than a liability. In this view, the documented procedures for doing the work are always inadequate, they can never capture all of the messy details of real work. It is the human ability to adapt, to change the way that they do the work based on circumstances, that creates safety. That’s why you’ll hear resilience engineering folks use the (positive) term adaptive capacity rather than the (more neutral) human variability, to emphasize that human variability is, quite literally, adaptive. This is why tech companies still staff on-call rotations even though they have complex automation that is supposed to keep things up and running. It’s because the automation can never handle all of the cases that the universe will throw at it. Even sophisticated automation always eventually proves too rigid to be able to handle some particular circumstance that was never foreseen by the designers. This is the perfect-storm, weird-edge-case stuff that post-incident write-ups are made of.

This, again, brings us back to AI.

My own field of software development is being roiled by the adoption of AI-based coding tools like Anthropic’s Claude Code, OpenAI’s Codex, and Google’s Gemini Code Assist. These AI tools are rapidly changing the way that software is being developed, and you can read many blog posts of early adopters who are describing their experiences using these new tools. Just this week, there was a big drop in the market value of multiple software companies; I’ve already seen references to the beginning of the SaaS-Pocalypse, the idea being that companies will write bespoke tools using AI rather than purchasing software from vendors. The field of software development has seen a lot of change in terms of tooling in my own career, but one thing that is genuinely different about these AI-based tools is that they are inherently non-deterministic. You interact with these tools by prompting them, but the same prompt yields different results.

Non-determinism in software development tools is seen as a bad thing. The classic example of non-determinism-as-bad is flaky tests. A flaky test is non-deterministic: the same input may lead to a pass or a fail. Nobody wants non-determinism like this in our test suite. On the build side of things, we hope that our compiler emits the same instructions given the same source file and arguments. There’s even a whole movement around reproducible builds, the goal of which is to stamp out all of the non-determinism in the process of producing binaries from the original source code, where the ideal is achieving bit-for-bit identical binaries. Unsurprisingly, then, the non-determinism of the current breed of AI coding tools is seen as a problem. Here’s a quote from a recent article in the Wall Street Journal by Chip Cutter and Sebastian Herrera: Here’s Where AI Is Tearing Through Corporate America:

Satheesh Ravala is chief technology officer of Candescent, which makes digital technology used by banks and credit unions. He has fielded questions from employees about what innovations like Anthropic’s new features mean for the company, and responded by telling them banks rely on the company for software that does exactly what it’s supposed to every time—something AI struggles with.

“If I want to transfer $10,” he said, “it better be $10 not $9.99.”

I believe the AI coding tools are only going to improve with time, though I don’t feel confident in predicting whether future improvements will be orders-of-magnitude or merely incremental. What I do feel confident in predicting is that the non-determinism in these tools isn’t going away.

At their heart, these tools are sophisticated statistical models: they are prediction machines. When you’re chatting with one, it is predicting the next word to say, and then it feeds back the entire conversation so far, predicts the next word to say again, and so on. Because they are statistical models, there is some probability distribution of next word to predict. You could build the system to always choose the most likely word to say next. Statistical models aren’t just an AI thing, and many statistical models do use such a maximum likelihood approach. But that’s not what LLMs do in general. Instead, there’s some randomness that is intentionally injected into the system so that it doesn’t always just pick the most likely next word, but instead does a biased random selection of the next word, based on the statistical model of what’s most likely to come next, and based on a parameter called temperature, drawing an analogy to physics. If the temperature is zero, then the system always outputs the most likely next word. The higher the temperature, the more random the selection is.

What’s fascinating to me about this is the deliberate injection of randomness improved the output of the models, as judged qualitatively by humans. In other words, increasing the variability of the system improved outcomes.

Now, these LLMs haven’t achieved the level of adaptability that humans possess, though they can certainly perform some impressive cognitive tasks. I wouldn’t say they have adaptive capacity, and I firmly believe that humans will still need to be on-call for software system for the remainder of my career, despite the proliferation of AI SRE solutions. But what I am saying instead is that the ability of LLMs to perform cognitive tasks well depends upon them being able to leverage variability. And my prediction is that this dependence on variability isn’t going to go away. LLMs will get better, and they might even get much better, but I don’t think they’ll ever be deterministic. I think variability is an essential ingredient for a system to be able to perform these sorts of complex cognitive tasks.

Ashby taught us we have to fight fire with fire

There’s an old saying in software engineering, originally attributed to David Wheeler: We can solve any problem by introducing an extra level of indirection. The problem is that indirection adds complexity to a system. Just ask anybody who is learning C and is wrestling with the concept of pointers. Or ask someone who is operating an unfamiliar codebase and is trying to use grep to find the code that relates to certain log messages. Indirection is a powerful tool, but it also renders systems more difficult to reason about.

The old saying points at a more general phenomenon: our engineering solutions to problems invariably add complexity.

Spinning is hard

There was a fun example of this phenomenon that made it to Hacker News the other day. It was a post written by Clément Grégoire of siliceum titled Spinning around: Please don’t!. The post was about the challenges of implementing spin locks.

A spin lock is a type of lock where the thread spins in a loop waiting for the lock to be released so it can grab it. The appeal of a spin-lock is that it should be faster than a traditional mutex lock provided by the operating system: using a spin-lock saves you the performance cost of doing a context switch into the kernel. Grégoire’s initial C++ spin-lock implementation looks basically like this (I made some very minor style changes):

			
class SpinLock {
  int is_locked = 0;
  public:
    void lock() {
      while (is_locked != 0) { /* spin */ }
      is_locked = 1;
    }
    void unlock() { is_locked = 0; }
}

		

As far as locking implementations go, this is a simple one. Unfortunately, it has all sorts of problems. Grégoire’s post goes on to describe these problems, as well as potential solutions, and additional problems created by those proposed solutions. Along the way, he mentions issues such as:

torn reads
race condition
high CPU utilization when not using the dedicated PAUSE (x86) or YIELD (arm) (spin loop hint) instruction
waiting for too long when using the dedicated instruction
contention across multiple cores attempting atomic writes
high cache coherency traffic across multiple core caches
excessive use of memory barriers
priority inversion
false sharing

Below is an implementation that Grégoire proposes to address these issues, with very slight modifications. Note that it requires a system call, so it’s operating-system-specific. He used Windows systems calls, so that’s why I used as well: on Linux, Grégoire notes that you can use the futex API.

(Note: I did not even try to run this code, it’s just to illustrate what the solution looks like)

			
#include <atomic> // for std::atomic
#include <Windows.h> // for WaitOnAddress, WakeByAddressSingle
class SpinLock {
  std::atomic<int32_t> is_locked{0};
  public:
    void lock();
    void unlock(); 
};
void cpu_pause() {
#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)
    _mm_pause();
#elif defined(__arm__) || defined(__aarch64__) || defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC)
    __builtin_arm_yield();
#else
    #error "unknown instruction set"
#endif
}
static inline uint64_t get_tsc() {
#if defined(__i386__) || defined(__x86_64__) || defined(_M_IX86) || defined(_M_X64)
    return __rdtsc();
#elif defined(__arm__) || defined(__aarch64__) || defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC)
    return __builtin_arm_rsr64("cntvct_el0");
#else
    #error "unknown instruction set"
#endif
}
static inline bool before(uint64_t a, uint64_t b) {
    return ((int64_t)b - (int64_t)a) > 0;
}
struct Yielder {
  static const int maxPauses = 64; // MAX_BACKOFF
  int nbPauses = 1;
  const int maxCycles =/*Some value*/;
  void do_yield_expo_and_jitter() {
    uint64_t beginTSC = get_tsc();
    uint64_t endTSC = beginTSC + maxCycles; // Max duration of the yield
    // jitter is in the range of [0;nbPauses-1].
    // We can use bitwise AND since nbPauses is a power of 2.
    const int jitter = static_cast<int>(beginTSC & (nbPauses - 1));
    // So subtracting we get a value between [1;nbPauses]
    const int nbPausesThisLoop = nbPauses - jitter;
    for (int i = 0; i < nbPausesThisLoop && before(get_tsc(), endTSC); i++)
      cpu_pause();
    // Multiply the number of pauses by 2 until we reach the max backoff count.
    nbPauses = nbPauses < maxPauses ? nbPauses * 2 : nbPauses;
  }
   void do_yield(int32_t* address, int32_t comparisonValue, uint32_t timeoutMs) {
    do_yield_expo_and_jitter();
    if (nbPauses >= maxPauses) {
      WaitOnAddress(address, &comparisonValue, sizeof(comparisonValue), timeoutMs);
      nbPauses = 1;
    }
  }
};
void SpinLock::lock() {
    Yielder yield;
    // Actually start by an exchange, we assume the lock is not already taken
    // This is because the main use case of a spinlock is when there's no contention!
    while (is_locked.exchange(1, std::memory_order_acquire) != 0)
    {
        // To avoid locking the cache line with a write access, always only read before attempting the writes
        do {
            yield.do_yield(&is_locked, 1 /*while locked*/ , 1 /*ms*/);
        } while (is_locked.load(std::memory_order_relaxed) != 0);
    }
}
void SpinLock::unlock() {
  is_locked = 0;
  WakeByAddressSingle(&is_locked); // Notify a potential thread waiting, if any  
}

		

Yeesh, this is a lot more complex than our original solution! And yet, that complexity exists to address real problems. It uses dedicated hardware instructions for spin-looping more efficiently, it uses exponential backoff with jittering to reduce contention across cores, it takes into account memory ordering to eliminate unwanted barriers, and it uses special system calls to help the system calls schedule the threads more effectively. The simplicity of this initial solution was no match for the complexity of modern multi-core NUMA machines. No matter how simple that initial solution looked as a C++ program, the solution must interact with the complexity of the fundamental building blocks of compilers, operating systems, and hardware architecture.

Flying is even harder

Now let’s take an example from outside of software: aviation. Consider the following two airplanes: a WWI era Sopwith Camel, and a Boeing 787 Dreamliner.

While we debate endlessly over what we mean by complexity, I feel confident in claiming that the Dreamliner is a more complex airplane than the Camel. Heck, just look at the difference in the engines used by the two planes: the Clerget 9B for the Camel, and the GE GEnx for the Dreamliner.

Image attributions
Sopwith Camel: Airwolfhound, CC BY-SA 2.0 via Wikimedia Commons
Boeing 787 Dreamliner: pjs2005 from Hampshire, UK, CC BY-SA 2.0 via Wikimedia Commons
Clerget 9B: Nimbus227, Public domain, via Wikimedia Commons
GE Genx: Olivier Cleynen, CC BY-SA 3.0 via Wikimedia Commons

And, yet, despite the Camel being simpler than the Dreamliner, the Camel was such a dangerous airplane to fly that almost as many Camel pilots died flying it in training as were killed flying in combat! The Dreamliner is both more complex and safer. The additional complexity is doing real work here, it contributes to making the Dreamliner safer.

Complexity: threat or menace?

But we’re also right to fear complexity. Complexity makes it harder for us humans to reason about the behavior of systems. Evolution has certainly accomplished remarkable things in designing biological systems: these systems are amazingly resilient. One thing they aren’t, though, is easy to understand, as any biology major will tell you.

Complexity also creates novel failure modes. The Dreamliner itself experienced safety issues related to electrical fires: a problem that Camel pilots never had to worry about. And there were outright crashes where software complexity was a contributing factor, such as Lion Air Flight 610, Ethiopian Airlines Flight 302 (both Boeing 737 MAX aircraft), and Air France Flight 447 (an Airbus A330).

Unfortunately for us, making systems more robust means adding complexity. An alternate formulation of the saying at the top of this post is: All problems in computer science can be solved by another level of indirection, except for the problem of too many layers of indirection. Complexity solves all problems except the problem of complexity.

The psychiatrist and cybernetician W. Ross Ashby expressed this phenomenon as a law, which he called the Law of Requisite Variety. Today it’s also known as Ashby’s Law. Ashby noted that when you’re building a control system, the more complex the problem space is, the more complex your controller needs to be. For example, a self-driving car is necessarily going to have a much more complex control system than a thermostat.

When faced with a complex problem, we have to throw complexity at it in order to solve it.

This blog is called surfing complexity because I want to capture the notion that we will always have to deal with complexity: we can’t defeat it, but we can get better at navigating through it effectively.

And that brings us, of course, to AI.

Throwing complexity back at the computer

Modern LLM systems are enormously complex. OpenAI, Anthropic, and Google don’t publish parameter counts for their models anymore, but Meta’s Llama 4 has 17 billion active parameters, and either 109 or 400 billion total parameters, depending on the model. These systems are so complex that trying to understand their behavior looks more like biology research than engineering.

One type of task that LLMs are very good at solving are the kind of problem that exist solely because of computers in. For example, have you ever struggled to align content the right way in a Word document? It’s an absolute exercise in frustration. My wife threw this problem at an LLM and it fixed up the formatting for her. I’ve used LLMs for various tasks myself, including asking it do development tasks, and using it like a search engine to answer questions. And sometimes it works well, and sometimes it doesn’t. But where’ve find that these tools really shine is when I’ve got some batch of data, maybe a log file or a CSV or some JSON, and I want it do some processing task, like change the shape of it, or extract some data, so I can feed it into some other thing. I don’t ask it for the output directly, instead I ask it to generate a shell script or a Perl one-liner that’ll do the ad-hoc task, and then I run it. And, like the Word problem, I have a problem that was created by a computer that I need to solve.

I’m using this enormously complex system, an LLM, to help me solve a problem that was created by software complexity in the first place.

Back in March 2016, Tom Limoncelli wrote a piece for ACM Queue titled Automation Should Be Like Iron Man, Not Ultron. Drawing inspiration from John Allspaw in particular, and Cognitive Systems Engineering in general, Limoncelli argued that automation should be written to be directable by humans rather than acting fully independently. He drew an analogy to Iron Man’s suit being an example of good automation, and the robot villain Ultron being an example of bad automation. Iron Man’s suit enables him to do things he couldn’t do otherwise, but he remains in control of it, and he can direct it to do the things he needs it to do. Ultron is an autonomous agent that was built for defensive purposes but ends up behaving unexpectedly, causing more problems than it solves. But my recent experiences with LLMs have led me to a different analogy: Tron.

In the original movie, Tron is a good computer program that fights the bad computer programs. In particular, he’s opposed to the Master Control Program, an evil AI who is referred to as the MCP. (Incidentally, this is what a Gen Xer like me automatically thinks of when hearing the term “MCP”). Tron struggles against the MCP on behalf of the humans, who create and use the programs. He fights for the users.

I frequently describe my day-to-day work as “fighting with the computer”. On some days I win the fight, and on some days I lose. AI tools have not removed the need to fight with the computer to get my work done. But now I can send an AI agent to fight some of these battles for me. There’s a software agent who will fight with other software on my behalf. They haven’t reduced the overall complexity of the system. In fact, if you take into account the LLM’s complexity, the overall system complexity is much larger. But I’m deploying this complexity in an Ashby-ian sense, to help defeat other software complexity so I can get my work done. Like Tron, it fights for the user.

Because coordination is expensive

If you’ve ever worked at a larger organization, stop me if you’ve heard (or asked!) any of these questions:

“Why do we move so slowly as an organization? We need to figure out how to move more quickly.”
“Why do we work in silos? We need to figure out how to break out of these.”
“Why do we spend so much of our time in meetings? We need to explicitly set no-meeting days so we can actually get real work done.”
“Why do we maintain multiple solutions for solving what’s basically the same problem? We should just standardize on one solution instead of duplicating work like this.”
“Why do we have so many layers of management? We should remove layers and increase span of control.”
“Why are we constantly re-org’ing? Re-orgs so disruptive.”

(As an aside, my favorite “multiple solutions” example is workflow management systems. I suspect that every senior-level engineer has contributed code to at least one home-grown workflow management system in their career).

The answer to all of these questions is the same: because coordination is expensive. It requires significant effort for a group of people to work together to achieve a task that is too large for them to accomplish individually. And the more people that are involved, the higher that coordination effort grows. This is both “effort” in terms of difficulty (effortful as hard), and in terms of time (engineering effort, as measured in person-hours). This is why you see siloed work and multiple systems that seem to do the same thing. It’s because it requires less effort to work within your organization then to coordinate across organization, the incentive is to do localized work whenever possible, in order to reduce those costs.

Time spent in meetings is one aspect of this cost, which is something people acutely feel, because it deprives them of their individual work time. But the meeting time is still work, it’s just unsatisfying-feeling coordination work. When was the last time you talked about your participation in meetings in your annual performance review? Nobody gets promoted for attending meetings, but we humans need them to coordinate our work, and that’s why they keep happening. As organizations grow, they require more coordination, which means more resources being put into coordination mechanisms, like meetings and middle management. It’s like an organizational law of thermodynamics. It’s why you’ll hear ICs at larger organizations talk about Tanya Reilly’s notion of glue work so much. You’ll hear companies run “One <COMPANY NAME>” campaigns at larger companies as an attempt to improve coordination; I remember the One SendGrid campaign back when I worked there.

Comic by ex-Googler Manu Cornet, 2021-02-18

Because of the challenges of coordination, there’s a brisk market in coordination tools. Some examples off the top of my head include: Gantt charts, written specifications, Jira, Slack, daily stand-ups, OKRs, kanban boards, Asana, Linear, pull requests, email, Google docs, Zoom, I’m sure you could name dozens more, including some that are no longer with us. (Remember Google Wave?). Heck, both spoken and written language are the ultimately communication ur-tools.

And yet, despite the existence of all of those tools, it’s still hard to coordinate. Remember back in 2002 when Google experimented with eliminating engineering managers? (“That experiment lasted only a few months“). And then in 2015 when Zappos experimented with holacracy? (“Flat on paper, hierarchy in practice.“) I don’t blame them for trying different approaches, but I’m also not surprised that these experiments failed. Human coordination is just fundamentally difficult. There’s no one weird trick that is going to make the problem go away.

I think it’s notable that large companies try different strategies to try to manage ongoing coordination costs. Amazon is famous for using a decentralization strategy, they have historically operated almost like a federation of independent startups, and enforce coordination through software service interfaces, as described in Steve Yegge’s famous internal Google memo. Google, on the other hand, is famous for using an invest-heavily-in-centralized-tooling approach to coordination. But there are other types of coordination that are outside of the scope of these sorts of solutions, such as working on an initiative that involves work from multiple different teams and orgs. I haven’t worked inside of either Amazon or Google, so I don’t know how well things work in practice there, but I bet employees have some great stories!

During incidents, coordination becomes an acute problem, and we humans are pretty good at dealing with acute problems. The organization will explicitly invest in an incident manager on-call rotation to help manage those communication costs. But coordination is also a chronic problem in organizations, and we’re just not as good at dealing with chronic problems. The first step, though, is recognizing the problem. Meetings are real work. That work is frequently done poorly, but that’s an argument for getting better at it. Because that’s important work that needs to get done. Oh, also, those people doing glue work have real value.

Amdahl, Gustafson, coding agents, and you

In the software operations world, if your service is successful, then eventually the load on it is going to increase to the point where you’ll need to give that services more resources. There are two strategies for increasing resources: scale up and scale out.

Scaling up means running the service on a beefier system. This works well, but you can only scale up so much before you run into limits of how large a machine you have access to. AWS has many different instance types, but there will come a time when even the largest instance type isn’t big enough for your needs.

The alternative is scaling out: instead of running your service on a bigger machine, you run your service on more machines, distributing the load across those machines. Scaling out is very effective if you are operating a stateless, shared-nothing microservice: any machine can service any request. It doesn’t work as well for services where the different machines need to access shared state, like a distributed database. A database is harder to scale out because the machines need to share state, which means they need to coordinate with each other.

Once you have to do coordination, you no longer get a linear improvement in capacity based on the number of machines: doubling the number of machines doesn’t mean you can handle double the load. This comes up in scientific computing applications, where you want to run a large computing simulation, like a climate model, on a large-scale parallel computer. You can run independent simulations very easily in parallel, but if you want to run an individual simulation more quickly, you need to break up the problem in order to distribute the work across different processors. Imagine modeling the atmosphere as a huge grid, and dividing up that grid and having different processors work on simulating different parts of the grid. You need to exchange information between processors at the grid boundaries, which introduces the need for coordination. Incidentally, this is why supercomputers have custom networking architectures, in order to try to reduce these expensive coordination costs.

In the 1960s, the American computer architect Gene Amdahl made the observation that the theoretical performance improvement you can get from a parallel computer is limited by the fraction of work that cannot be parallelized. Imagine you have a workload where 99% of the work is amenable to parallelization, but 1% of it can’t be parallelized:

Let’s say that running this workload on a single machine takes 100 hours. Now, if you ran this on an infinitely large supercomputer, the green part above would go from 99 hours to 0. But you are still left with the 1 hour of work that you can’t parallelize, which means that you are limited to 100X speedup no matter how large your supercomputer is. The upper limit on speedup based on the amount of the workload that is parallelizable is known today as Amdahl’s Law.

But there’s another law about scalability on parallel computers, and it’s called Gustafson’s Law, named for the American computer scientist John Gustafson. Gustafson observed that people don’t just use supercomputers to solve existing problems more quickly. Instead, they exploit the additional resources available in supercomputers to solve larger problems. The larger the problem, the more amenable it is to parallelization. And so Gustafson proposed scaled speedup as an alternative metric, which takes this into account. As he put it: in practice, the problem size scales with the number of processors.

And that brings us to LLM-based coding agents.

AI coding agents improve programmer productivity: they can generate working code a lot more quickly than humans can. As a consequence of this productivity increase, I think we are going to find the same result that Gustafson observed at Sandia National Labs: that people will use this productivity increase in order to do more work, rather than simply do the same amount of coding work with fewer resources. This is a direct consequence of the law of stretched systems from cognitive systems engineering: systems always get driven to their maximum capacity. If coding agents save you time, you’re going to be expected to do additional work with that newfound time. You launch that agent, and then you go off on your own to do other work, and then you context-switch back when the agent is ready for more input.

And that brings us back to Amdahl: coordination still places a hard limit on how much you can actually do. Another finding from cognitive systems engineering is that coordination costs, continually. Coordination work require requires continuous investment of effort. The path we’re on feels the work of software development is shifting from direct coding to a human coordinating with a single agent to a human coordinating work among multiple agents working in parallel. It’s possible that we will be able to fully automate this coordination work, by using agents to do the coordination. Steve Yegge’s Gas Town project is an experiment to see how far this sort of automated agent-based coordination can go. But I’m pessimistic on this front. I think that we’ll need human software engineers to coordinate coding agents for the foreseeable future. And the law of stretched systems teaches us that these multi-coding-agent systems are going to keep scaling up the number of agents until the human coordination work becomes the fundamental bottleneck.

From Rasmussen to Moylan

I hadn’t heard of James Moylan until I read a story about him in the Wall Street Journal after he passed away in December, but it turns out my gaze had fallen on one his designs almost every day of my adult life. Moylan was the designer at Ford who came up with the idea of putting an arrow next to the gas tank symbol to indicate which side of the car the tank was on. It’s called the Moylan Arrow in his honor.

The Moylan Arrow put me in mind of another person we lost in 2025, the safety researcher James Reason. If you’ve heard of James Reason, it’s probably because of the Swiss cheese model of accidents that Reason proposed. But Reason made other conceptual contributions to the field of safety, such as organizational accidents and resident pathogens. The contribution that inspired this post was his model of human error described in his book Human Error. The model is technically called the Generic Error-Modeling System (GEMS), but I don’t know if anybody actually refers to it by that name. And the reason GEMS came to mind was because Reason’s model was itself built on top of another researcher’s model of human performance, Jens Rasmussen’s Skills, Rules and Knowledge (SRK) model.

Rasmussen was trying to model how skilled operators perform tasks, how they process information in order to do so, and how user interfaces like control panels could better support their work. He worked at a Danish research lab focused on atomic energy, and his previous work included designing a control room for a Danish nuclear reactor, as well as studying how technicians debugged problems in electronics circuits.

The part of the SRK model that I want to talk about here is the information processing aspect. Rasmussen draws a distinction between three different types of information, which he labels signals, signs, and symbols.

The signal is the most obvious type of visual information to absorb, where there is minimal interpretation required to make sense of the signal. Consider the example of the height of mercury in a thermometer to observe the temperature. There’s a direct mapping between the visual representation of the sensor and the underlying phenomenon in the environment – a higher level of mercury means a hotter temperature.

A sign requires some background knowledge in order to interpret the visual information, but once you have internalized that information, you will be able to very quickly to interpret its meaning sign. Traffic lights are one such example: there’s no direct physical relationship between a red-colored light and the notion of “stop”, it’s an indirect association, mediated by cultural knowledge.

A symbol requires more active cognitive work in order to make sense of. To take an example from my own domain, reading the error logs emitted by a service would be an example of a task that involves visual information processing of symbols. Interpreting log error messages are much more laborious than, say, interpreting a spike in an error rate graph.

(Note: I can’t remember exactly where I got the thermometer and traffic light examples from, but I suspect it was from A Meaning Processing Approach to Cognition by John Flach and Fred Voorhost).

From his paper, Rasmussen describes signals as representing continuous variables. That being said, I propose the Moylan arrow as a great example of a signal, even though the arrow does not represent a continuous variable. Moylan’s arrow doesn’t require background knowledge to learn how to interpret it, because there’s a direct mapping between the direction the triangle is pointing and the location of the gas tank.

Rasmussen maps these three types of information processing to three types of behavior (signals relate to skill-based behavior, signs relate to rule-based behavior, and symbols relate to knowledge-based behavior). James Reason created an error taxonomy based on these different behaviors. In Reason’s terminology, slips and lapses happen at the skill-based level, rule-based mistakes happen at the rule-based level, and knowledge-based mistakes happen at the knowledge-based level.

Rasmussen’s original SRK paper is a classic of the field. Even though it’s forty years old, because the focus is on human performance and information processing, I think it’s even more relevant today than when it was originally published: thanks to open source tools like Grafana and the various observability vendors out there, there are orders of magnitude more operator dashboards being designed today than there were back in the 1980s. While we’ve gotten much better at being able to create dashboards, I don’t think my field has advanced much at being able to create effective dashboards.

Telling the wrong story

In last Sunday’s New York Times Book Review, there was an essay by Jennifer Szalai titled Hannah Arendt Is Not Your Icon. I was vaguely aware of Arendt as a public intellectual of the mid twentieth century, someone who was both philosopher and journalist. The only thing I really knew about her was that she had witnessed the trial of the Nazi official Adolph Eichmann and written a book on it, Eichmann in Jerusalem, subtitled a report on the banality of evil. Eichmann, it turned out, was not a fire-breathing monster, but a bloodless bureaucrat. He was dispassionately doing logistics work; it just so happened that his work was orchestrating the extermination of millions.

Until now, when I’d heard any reference to Arendt’s banality of evil, it had been as a notable discovery that Arendt had made as witness to the trial. And so I was surprised to read in Szalai’s essay how controversial Arendt’s ideas were when she originally published them. As Szala noted:

The Anti-Defamation League urged rabbis to denounce her from the pulpit. “Self-Hating Jewess Writes Pro-Eichmann Book” read a headline in the Intermountain Jewish News. In France, Le Nouvel Observateur published excerpts from the book and subsequently printed letters from outraged readers in a column asking, “Hannah Arendt: Est-elle nazie?”

Hannah Arendt, in turns out, had told the wrong story.

We all carry in our minds models of how the world works. We use these mental models to make sense of events that happen in the world. One of the tools we have for making sense of the world is storytelling; it’s through stories that we put events into a context that we can understand.

When we hear an effective story, we will make updates to our mental models based on its contents. But something different happens when we hear a story that is too much at odds with our worldview: we reject the story, declaring it to be obviously false. In Arendt’s case, her portrayal of Eichmann was too much of a contradiction against prevailing beliefs about the type of people who could carry out something like the Holocaust.

You can see a similar phenomenon playing out with Michael Lewis’s book Going Infinite, about the convicted crypto fraudster Sam Bankman-Fried. The reception to Lewis’s book has generally been negative, and he has been criticized for being too close to Bankman-Fried to write a clear-eyed book about him. But I think something else is at play here. I think Lewis told the wrong story.

It’s useful to compare Lewis’s book with two other recent ones about Silicon Valley executives: John Carreyrou’s Bad Blood and Sarah Wynn-Williams Careless People. Both books focus on the immorality of Silicon Valley executives (Elizabeth Holmes of Theranos in the first book, Mark Zuckerberg, Sheryl Sandberg, and Joel Kaplan of Facebook in the second). These are tales of ambition, hubris, and utter indifference to the human suffering left in their wake. Now, you could tell a similar story about Bankman-Fried. In fact, this is what Zeke Faux did in his book Number Go Up. but that’s not the story that Lewis told. Instead, Lewis told a very different kind of story. His book is more of a character study of a person with an extremely idiosyncratic view of risk. The story Lewis told about Bankman-Fried wasn’t the story that people wanted to hear. They wanted another Bad Blood, and that’s not the book he ended up writing. As a consequencee, he told the wrong story.

Telling the wrong story is a particular risk when it comes to explaining a public large-scale incidents. We’re inclined to believe that a big incident can only happen because of a big screw-up: that somebody must have done something wrong for that incident to happen. If, on the other hand, you tell a story about how the incident happened despite nobody doing anything wrong, then you are in essence telling an unbelievable story. And, by definition, people don’t believe unbelievable stories.

One example of such an incident story is the book Friendly Fire: The Accidental Shootdown of U.S. Black Hawks over Northern Iraq by Scott Snook. Here are some quotes from the Princeton University Press site for that book (emphasis mine).

On April 14, 1994, two U.S. Air Force F-15 fighters accidentally shot down two U.S. Army Black Hawk Helicopters over Northern Iraq, killing all twenty-six peacekeepers onboard. In response to this disaster the complete array of military and civilian investigative and judicial procedures ran their course. After almost two years of investigation with virtually unlimited resources, no culprit emerged, no bad guy showed himself, no smoking gun was found. This book attempts to make sense of this tragedy—a tragedy that on its surface makes no sense at all.

…

His conclusion is disturbing. This accident happened because, or perhaps in spite of everyone behaving just the way we would expect them to behave, just the way theory would predict. The shootdown was a normal accident in a highly reliable organization.

Snook also told the wrong story, one that subverts our usual sensemaking processes rather than supporting it: the accident makes no sense at all.

This is why I think it’s almost impossible to do an effective incident investigation for a public large-scale incident. The risk of telling the wrong story is simply too high.

Verizon outage report predictions

Yesterday, Verizon experienced a major outage. The company hasn’t released any details about how the outage happened yet, so there’s no quick takes to be had. And I have no personal experience in the telecom industry, and I’m not a network engineer, so I can’t even make any as-an-expert commentary, because I’m not nan expert. But I still thought it would be fun to make predictions about what the public write-up will reveal. I can promise that all of these predictions are free of hindsight bias!

Maintenance

On Bluesky, I predicted this incident involved planned maintenance, because the last four major telecom outages I read about all involved planned maintenance. The one foremost on my mind was the Optus emergency services outage that happened back in September in Australia, where the engineers were doing software upgrades on firewalls

Work was being done to install a firewall upgrade at the Regency Park exchange in SA.
There is nothing unusual about such upgrades in a network and this was part of a planned
program, spread over six months, to upgrade eighteen firewalls. At the time this specific
project started, fifteen of the eighteen upgrades had been successfully completed. – Independent Report – The Triple Zero Outage at Optus: 18 September 2025.

The one before that was the Rogers internet outage that happened in my Canada back in July 2022.

In the weeks leading to the day of the outage on 8 July 2022, Rogers was executing on a seven-phase process to upgrade its IP core network. The outage occurred during the sixth phase of this upgrade process. – Assessment of Rogers Networks for Resiliency and Reliability Following the 8 July 2022 Outage – Executive Summary

There was also a major AT&T outage in 2024. From the FCC report:

On Thursday, February 22, 2024, at 2:42 AM, an AT&T Mobility employee placed a new network element into its production network during a routine night maintenance window in order to expand network functionality and capacity. The network element was misconfigured. – February 22, 2024 AT&T Mobility Network Outage REPORT AND FINDINGS

Verizon also suffered from a network outage back in September 30, 2024. Although the FCC acknowledged the outage, I couldn’t find any information from either Verizon or the FCC about the incident. The only information I was able to find about that outage comes from, of all places, a Reddit post. And it also mentions… planned maintenance!

So, we’re four for four on planned maintenance being in the mix.

I’m very happy that I did not pursue a career in network engineering: just given that the blast radius of networking changes can be very large, by the very nature of networks. It’s the ultimate example of “nobody notices your work when you do it well, they only become aware of your existence when something goes wrong. And, boy, can stuff go wrong!”

To me, networks is one of those “I can’t believe we don’t have even more outages” domains. Because, while I don’t work in this domain, I’m pretty confident that planned maintenance happens all of the time.

Saturation

The Rogers and AT&T outages involved saturation. From the Rogers executive summary (emphasis added), which I quoted in my original blog post

Rogers staff removed the Access Control List policy filter from the configuration of the distribution routers. This consequently resulted in a flood of IP routing information into the core network routers, which triggered the outage. The flood of IP routing data from the distribution routers into the core routers exceeded their capacity to process the information. The core routers crashed within minutes from the time the policy filter was removed from the distribution routers configuration. When the core network routers crashed, user traffic could no longer be routed to the appropriate destination. Consequently, services such as mobile, home phone, Internet, business wireline connectivity, and 9-1-1 calling ceased functioning.

From the FCC report on the AT&T outage:

Restoring service to commercial and residential users took several more hours as AT&T Mobility continued to observe congestion as high volumes of AT&T Mobility user devices attempted to register on the AT&T Mobility network. This forced some devices to revert back to SOS mode. For the next several hours, AT&T Mobility engineers engaged in additional actions, such as turning off access to congested systems and performing reboots to mitigate registration delays.

Saturation is such a common failure pattern in large-scale complex systems failures. We see it again and again, so often that I’m more surprised when it doesn’t show up. It might be that saturation contributed to a failure cascade, or that saturation made it more difficult to recover, but I’m predicting it’s in there somewhere.

“Somebody screwed up”

Here’s my pinned Bluesky post:

I’m going to predict that this incident will be attributed to engineers that didn’t comply with documented procedure for making the change, the kind of classic “root cause: human error” kind of stuff.

I was very critical of the Optus outage independent report for language like this:

These mistakes can only be explained by a lack of care about a critical service and a lack of disciplined adherence to procedure. Processes and controls were in place, but the correct process was not followed and actions to implement the controls were not done or not done properly.

The FCC report on the AT&T outage also makes reference to not following procedure (emphasis mine)

The Bureau finds that the extensive scope and duration of this outage was the result of
several factors, all attributable to AT&T Mobility, including a configuration error, a lack of adherence to AT&T Mobility’s internal procedures, a lack of peer review, a failure to adequately test after installation, inadequate laboratory testing, insufficient safeguards and controls to ensure approval of changes affecting the core network, a lack of controls to mitigate the effects of the outage once it began, and a variety of system issues that prolonged the outage once the configuration error had been remedied.

The Rogers independent report, to its credit, does not blame the operators for the outage. So I’m generalizing from only two data points for this prediction. I will be very happy if I’m wrong.

“Poor risk management”

This one isn’t a prediction, just an observation of a common element of two of the reports: criticizing the risk assessment of the change that triggered the incident. Here’s Optus report (emphasis in the original):

Risk was classified as ‘no impact’, meaning that there was to be no impact on network traffic, and the firewall upgrade was classified as urgent. This was the fifth mistake.

Similarly, the Rogers outage independent report blames the engineers for misclassifying the risk of the change:

Rogers classified the overall process – of which the policy filter configuration is only one of many parts – as “high risk”. However, as some earlier parts of the process were completed successfully, the risk level was reduced to “low”. This is an oversight in risk management as it took no consideration of the high-risk associated with BGP policy changes that had been implemented at the edge and affected the core.

…

Rogers had assessed the risk for the initial change of this seven-phased process as “High”. Subsequent changes in the series were listed as “Medium.” [redacted] was “Low” risk based on the Rogers algorithm that weighs prior success into the risk assessment value. Thus, the risk value for [redacted] was reduced to “Low” based on successful completion of prior changes.

The risk assessment rated as “Low” is not aligned with industry best practices for routing protocol configuration changes, especially when it is related to BGP routes distribution into the OSPF protocol in the IP core network. Such a configuration change should be considered as high risk and tested in the laboratory before deployment in the production network.

Unfortunately, it’s a lot easier to state”you clearly misjudged the risk!” then to ask “how did it make sense in the moment to assess the risk as low?”, and, hence, we learn nothing about how those judgments came to be.

I’m anxiously waiting to hear any more details about what happened. However, given that neither Verizon nor the FCC released any public information from the last outage, I’m not getting my hopes up.

On intuition and anxiety

Over at Aeon, there’s a thoughtful essay written by the American anesthesiologist Ronald Dworkin about how he unexpectedly began suffering from anxiety after returning to work from a long vacation. During surgeries he became plagued with doubt, where he experienced difficulty making decisions during scenarios that had never been a problem for him before.

Dworkin doesn’t characterizes his anxiety as the addition of something new to his state of being. Instead, he interprets becoming anxious as having something taken away from him, as summed up by the title of his essay: When I lost my intuition. To Dworkin, anxiety is the absence of intuition, its opposite.

To compensate for his newfound challenges in decision-making, Dworkin adopts an evidence-based strategy, but the strategy doesn’t work. He struggles with a case that involves a woman who had chewed gum before her scheduled procedure. Gum chewing increases gastric juice in the stomach, which raises the risk of choking while under anesthetic. Should he delay the procedure? He looks to medical journals for guidance, but the anesthesiology studies he finds on the effect of chewing gum were conducted in different contexts from his situation, and their results conflict with each other. This decision cannot be outsourced to previous scientific research: studies can provide context, but he must make the judgment call.

Dworkin looks to psychology for insight into the nature of intuition, so he can make sense of what he has lost. He name checks the big ideas from both academic psychology and pop psychology about intuition, including Herb Simon’s bounded rationality, Daniel Kahneman’s System 1 and System 2, Roger Sperry’s concept of analytic left-brain, intuitive right-brain, and the Myers-Briggs personality test notion of intuitive vs analytical. My personal favorite, the psychologist Gary Klein, receives only a single sentence in the essay:

In The Power of Intuition (2003), the research psychologist Gary Klein says the intuitive method can be rationally communicated to others, and enhanced through conscious effort.

In addition, Klein’s naturalistic decision-making model is not even mentioned explicitly. Instead, it’s the neuroscientist Joel Pearson’s SMILE framework that Dworkin connects with the most. SMILE stands for self-awareness, mastery, impulse control, low probability, and environment. It’s through the lens of SMILE that Dworkin makes sense of how his anxiety has robbed him of his intuition: he lost awareness of his own emotional state (self-awareness), he overestimated the likelihood of complications during surgery (low probability), and his long vacation made the hospital feel like an unfamiliar place (environment). I hadn’t heard of Pearson before this essay, but I have to admit that his website gives off the sort of celebrity-academic vibe that arouses my skepticism.

While the essay focuses on the intuition-anxiety dichotomy, Dworkin touches briefly on another dichotomy, between intuition and science. Intuition is a threat to science, because science is about logic, observation, and measurement to find truth, and intuition is not. Dworkin mentions the incompatibility of science and intuition only in passing before turning back to the theme of the role of intuition is in the work of the professional. The implication here is that professionals face different sorts of problems than scientists do. But I suspect the practice of real science involves a lot more intuition than this stereotyped view of it. I could not help thinking of the “Feynman Problem Solving Algorithm”, so named because it is attributed to the American physicist Richard Feynman.

Write down the problem
Think real hard
Write down the solution

Intuition certainly plays a role in step 2!

Eventually, Dworkin became comfortable again making the sort of high-consequence decisions under uncertainty that are required of a practicing anesthesiologist. As he saw it, his intuition returned. And, though he still experienced some level of doubt about his decisions, he came to realize that there was never a time when his medical decisions had been completely free of doubt: that was an illusion.

In the software operations world, we are often faced with these sorts of potentially high-consequence decisions under uncertainty, especially during incident response. Fortunately for us, the stakes are lower: lives are rarely on the line in the way that they are for doctors, especially when it comes to surgical procedures. But it’s no coincidence that How Complex Systems Fail was also written by an anesthesiologist. As Dr. Richard Cook reminds us in that short paper: all practitioner actions are gambles.

The dangers of SSL certificates

Yesterday, the Bazel team at Google did not have a very Merry Boxing Day. An SSL certificate expired for https://bcr.bazel.build and https://releases.bazel.build, as shown in this screenshot from the github issue.

This expired certificate apparently broke the build workflow of users who use Bazel, who were faced with the following error message:

ERROR: Error computing the main repository mapping: Error accessing registry https://bcr.bazel.build/: Failed to fetch registry file https://bcr.bazel.build/modules/platforms/0.0.7/MODULE.bazel: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed

After mitigation, Xùdōng Yáng provided a brief summary of the incident on the Github ticket:

Say the words “expired SSL certificate” to any senior software engineer and watch the expression on their face. Everybody in this industry has been bitten by expired certs, including people who work at orgs that use automated certificate renewal. In fact, this very case is an example of an automated certificate renewal system that failed! From the screenshot above:

it was an auto-renewal being bricked due to some new subdomain additions, and the renewal failures didn’t send notifications for whatever reason.

The reality is that SSL certificates are a fundamentally dangerous technology, and the Bazel case is a great example of why. With SSL certificates, you usually don’t have the opportunity to build up operational experience working with them, unless something goes wrong. And things don’t go wrong that often with certificates, especially if you’re using automated cert renewal! That means when something does go wrong, you’re effectively starting from scratch to figure out how to fix it, which is not a good place to be. Once again, from that summary:

And then it took some Bazel team members who were very unfamiliar with this whole area to scramble to read documentation and secure permissions…

Now, I don’t know the specifics of the Bazel team composition: it may very well be that they have local SSL certificate expertise on the team, but those members were out-of-office because of the holiday. But even if that’s the case, with an automated set-it-and-forget-it solution, the knowledge isn’t going to spread across the team, because why would it? It just works on its own.

That is, until it stops working. And that’s the other dangerous thing about SSL certificates: the failure mode is the opposite of graceful degradation. It’s not like there’s an increasing percentage of requests that fail as you get closer to the deadline. Instead, in one minute, everything’s working just fine, and in the next minute, every http request fails. There’s no natural signal back to the operators that the SSL certificate is getting close to expiry. To make things worse, there’s no staging of the change that triggers the expiration, because the change is time, and time marches on for everyone. You can’t set the SSL certificate expiration so it kicks in at different times for different cohorts of users.

In other words, SSL certs are a technology with an expected failure mode (expiration) that absolutely maximizes blast radius (a hard failure for 100% of users), without any natural feedback to operators that the system is at imminent risk of critical failure. And with automated cert renewal, you are increasing the likelihood that the responders will not have experience with renewing certificates.

Is it any wonder that these keep biting us?

Saturation: Waymo edition

If you’ve been to San Francisco recently, you will almost certainly have noticed the Waymo robotaxis: these are driverless cars that you can hail with an app the way that you can with Uber. This past Sunday, San Francisco experienced a pretty significant power outage. One unexpected consequence of this power outage was that the Waymo robotaxis got stuck.

Today, Waymo put up a blog post about what happened, called Autonomously navigating the real world: lessons from the PG&E outage. Waymos are supposed to treat intersections with traffic lights out as four-way stops, the same way that humans do. So, what happened here? From the post (emphasis added):

While the Waymo Driver is designed to handle dark traffic signals as four-way stops, it may occasionally request a confirmation check to ensure it makes the safest choice. While we successfully traversed more than 7,000 dark signals on Saturday, the outage created a concentrated spike in these requests. This created a backlog that, in some cases, led to response delays contributing to congestion on already-overwhelmed streets.

The post doesn’t go into detail about what a confirmation check is. My interpretation based on the context is that it’s a put-a-human-in-the-loop thing, where a remote human teleoperator checks to see if it’s safe to proceed. It sounds like the workload on the human operators was just too high to process all of these confirmation checks in a timely matter. You can’t just ask your cloud provider for more human operators the way you can request more compute resources.

The failure mode that Waymo encountered is a classic example of saturation, which is a topic I’ve written about multiple times in this blog. Saturation happens when the system is not able to keep up with the load that is placed upon it. Because all systems have finite resources, saturation is an ever-present risk. And because saturation only happens under elevated load, it’s easy to miss this risk. There are many different things in your system that can run out of resources, and it can be hard to imagine the scenarios that can lead to exhaustion for each of them.

Here’s another quote from the post. Once again, emphasis mine.

We established these confirmation protocols out of an abundance of caution during our early deployment, and we are now refining them to match our current scale. While this strategy was effective during smaller outages, we are now implementing fleet-wide updates that provide the Driver with specific power outage context, allowing it to navigate more decisively.

This confirmation-check behavior was explicitly implemented in order to increase safety! It’s yet another reminder how work to increase safety can lead to novel, unanticipated failure modes. Strange things are going to happen, especially at scale.