Does any of this sound familiar?

The other day, David Woods responded to one of my tweets with a link to a talk:

I had planned to write a post on the component substitution fallacy that he referenced, but I didn’t even make it to minute 25 of that video before I heard something else that I had to post first, at the 7:54 mark. The context of it is describing the state of NASA as an organization at the time of the Space Shuttle Columbia accident.

And here’s what Woods says in the talk:

Now, does this describe your organization?

Are you in a changing environment under resource pressures and new performance demands?

Are you being pressured to drive down costs, work with shorter, more aggressive schedules?

Are you working with new partners and new relationships where there are new roles coming into play, often as new capabilities come to bear?

Do we see changing skillsets, skill erosion in some areas, new forms of expertise that are needed?

Is there heightened stakeholder interest?

Finally, he asks:

How are you navigating these seas of complexity that NASA confronted?

How are you doing with that?

A small mistake does not a complex systems failure make

I’ve been trying to take a break from Twitter lately, but today I popped back in, only to be trolled by a colleague of mine:

Here’s a quote from the story:

The source of the problem was reportedly a single engineer who made a small mistake with a file transfer.

Here’s what I’d like you to ponder, dear reader. Think about all of the small mistakes that happen every single day, at every workplace, on the face of the planet. If a small mistake was sufficient to take down a complex system, then our systems would be crashing all of the time. And, yet, that clearly isn’t the case. For example, before this event, when was the last time the FAA suffered a catastrophic outage?

Now, it might be the case that no FAA employees have ever made a small mistake until now. Or, more likely, the FAA system works in such a way that small mistakes are not typically sufficient for taking down the entire system.

To understand this failure mode, you need to understand how it is that the FAA system is able to stay up and running on a day-to-day basis, despite the system being populated by fallible human beings who are capable of making small mistakes. You need to understand how the system actually works in order to make sense of how a large-scale failure can happen.

Now, I’ve never worked in the aviation industry, and consequently I don’t have domain knowledge about the FAA system. But I can tell you one thing: a small mistake with a file transfer is a hopelessly incomplete explanation for how the FAA system actually failed.

The Allspaw-Collins effect

While reading Laura Maguire’s PhD dissertation, Controlling the Costs of Coordination in Large-scale Distributed Software Systems, when I came across a term I hadn’t heard before: the Allspaw-Collins effect:

An example of how practitioners circumvent the extensive costs inherent in the Incident Command model is the Allspaw-Collins effect (named for the engineers who first noticed the pattern). This is commonly seen in the creation of side channels away from the group response effort, which is “necessary to accomplish cognitively demanding work but leaves the other participants disconnected from the progress going on in the side channel (p.81)”

Here’s my understanding:

A group of people responding to an incident have to process a large number of signals that are coming in. One example of such signals is the large number of Slack messages appearing in the incident channel as incident responders and others provide updates and ask questions. Another example would be additional alerts firing.

If there’s a designated incident commander (IC) who is responsible for coordination, the IC can become a bottleneck if they can’t keep up with the work of processing all of these incoming signals.

The effect captures how incident responders will sometimes work around this bottleneck by forming alternate communication channels so they can coordinate directly with each other, without having to mediate through the IC. For example, instead of sending messages in the main incident channel, they might DM each other or communicate in a separate (possibly private) channel.

I can imagine how this sort of side-channel communication would be officially sanctioned (“all incident-related communication should happen in the incident response channel!“), and also how it can be adaptive.

Maguire doesn’t give the first names of the people the effect is named for, but I strongly suspect they are John Allspaw and Morgan Collins.

Southwest airlines: a case study in brittleness

What happens to a complex system when it gets pushed just past its limits?

In some cases, the system in question is able to actually change its limits, so it can handle the new stressors that get thrown at it. When a system is pushed beyond its design limit, it has to change the way that it works. The system needs to adapt its own processes to work in a different way.

We use the term resilience to describe the ability of a system to adapt how it does its work, and this is what resilience engineering researchers study. These researchers have identified multiple factors that foster resilience. For example, people on the front lines of the system need autonomy to be able to change the way they work, and they also need to coordinate effectively with others in the system. A system under stress inevitably need access to additional resources, which means that there needs to be extra capacity that was held in reserve. People need to be able to anticipate trouble ahead, so that they can prepare to change how they work and deploy the extra capacity.

However, there are cases when systems fail to adapt effectively when pushed just beyond their limits. These systems face what Woods and Branlat call decompensation: they exhaust their ability to keep up with the demands placed on them, and their performance falls sharply off of a cliff. This behavior is the opposite of resilience, and researchers call it brittleness.

The ongoing problems facing Southwest Airlines provides us with a clear example of brittleness. External factors such as the large winter storm pushed the system past its limit, and it was not able to compensate effectively in the face of these stressors.

There are many reports coming out of the media now about different factors that contributed to Southwest’s brittleness. I think it’s too early to treat these as definitive. A proper investigation will likely take weeks if not months. When the investigation finally gets completed, I’m sure there will be additional factors identified that haven’t been reported on yet.

But one thing we can be sure of at this point is that Southwest Airlines fell over when pushed beyond its limits. It was brittle.

Incident categories I’d like to see

If you’re categorizing your incidents by cause, here are some options for causes that I’d love to see used. These are all taken directly from the field of cognitive systems engineering research.

Production pressure

All of us are so often working near saturation: we have more work to do than time to do it. As a consequence, we experience pressure to get that work done, and the pressure affects how we do our work and the decisions we make. Multi-tasking is a good example of a symptom of production pressure.

Ask yourself “for the people whose actions contributed to the incident, what was their personal workload like? How did it shape their actions?”

Goal conflicts

Often we’re trying to achieve multiple goals while doing our work. For example, you may have a goal to get some new feature out quickly (production pressure!), but you also have a goal to keep your system up and running as you make changes. This creates a goal conflict around how much time you should put into validation: the goal of delivering features quickly pushes you towards reducing validation time, and the goal of keeping the system up and running pushes you towards increasing validation time.

If someone asks “Why did you take action X when it clearly contravenes goal G?”, you should ask yourself “was there another important goal, G1, that this action was in support of?”

Workarounds

How do you feel about the quality of the software tools that you use in order to get your work done? (As an example: how are the deployment tools in your org?)

Often the tools that we use are inadequate in one way or another, and so we resort to workarounds: getting our work done in a way that works but is not the “right” way to do it (e.g., not how the tool was designed to be used, against the official process of how to do things). Using workarounds is often dangerous because the system wasn’t designed with that type of work in mind. But if the dangerous way of doing work is the only way that the work can get done, then you’re going to end up with people taking dangerous actions.

If an incident involves someone doing something they weren’t “supposed to”, you should ask yourself, “did they do it this way because they are working around some deficiency in the tools that have to use?”

Automation surprises

Software automation often behaves in ways that people don’t expect: we have incorrect mental models of why the system is doing what it is, often because the system isn’t designed in a way to make it easy for us to form good mental models of behavior. (As someone who works on a declarative deployment system, I acutely feel the pain we can inflict on our users in this area).

If someone took the “wrong” action when interacting with a software system in some way, ask yourself “what was their understanding of the state of the world at the time, and what was their understanding of what the result of that action would be? How did they form their understanding of the system behavior?”


Do you find this topic interesting? If so, I bet you’ll enjoy attending the upcoming Learning from Incidents Conference taking place on Feb 15-16, 2023 in Denver, CO.

Cache invalidation really is one of the hardest problems in computer science

My colleagues recently wrote a great post on the Netflix tech blog about a tough performance issue they wrestled with. They ultimately diagnosed the problem as false sharing, which is a performance problem that involves caching.

I’m going to take that post and write a simplified version of part of it here, as an exercise to help me understand what happened. After all, the best way to understand something is to try to explain it to someone else.

But note that the topic I’m writing about here is outside of my personal area of expertise, so caveat lector!

The problem: two bands of CPU performance

Here’s a graph from that post that illustrates the problem. It shows CPU utilization for different virtual machines instances (nodes) inside of a cluster. Note that all of the nodes are configured identically, including running the same application logic and taking the same traffic.

Note that there are two “bands”, a low band at around 15-20% CPU utilization, and a high band that varies a lot, from about 25%-90%.

Caching and multiple cores

Computer programs keep the data that they need in main memory. The problem with main memory is that accessing it is slow in computer time. According to this site, a CPU instruction cycle is about 400ps, and accessing main memory (DRAM access) is 50-100ns, which means it takes ~ 125 – 250 cycles. To improve performance, CPUs keep some of the memory in a faster, local cache.

There’s a tradeoff between the size of the cache and its speed, and so computer architects use a hierarchical cache design where they have multiple caches of different sizes and speeds. It was an interaction pattern with the fastest on-core cache (the L1 cache) that led to the problem described here, so that’s the cache we’ll focus on in this post.

If you’re a computer engineer designing a multi-core system where each core has on-core cache, your system has to implement a solution for the problem known as cache coherency.

Cache coherency

Imagine a multi-threaded program where each thread is running on a different core:

  • thread T1 runs on CPU 1
  • thread T2 runs on CPU 2

There’s a variable used by the program, which we’ll call x.

Let’s also assume that both threads have previously read x, so the memory associated with x is loaded in the caches of both. So the caches look like this:

Now imagine thread T1 modifies x, and then T2 reads x.


T1             T2
--             --
x = x + 1
              if(x==0) {
              // shouldn't execute this!
              }
         

The problem is that T2’s local cache has become stale, and so it reads a value that is no longer valid.

The term cache coherency refers to the problem of ensuring that local caches in a multi-core (or, more generally, distributed) system stay in sync.

This problem is solved by a hardware device called a cache controller. The cache controller can detect when values in a cache have been modified on one core, and whether another core has cached the same data. In this case, the cache controller invalidates the stale cache. In the example above, the cache controller would invalidate the cache in T2. When T2 went to read the variable x, it would have to read the data from main memory into the core.

Cache coherency ensures that the behavior is correct, but every time a cache is invalidated and the same memory has to be retrieved from main memory again, it pays the performance penalty of reading from main memory.

The diagram above shows that the cache contains both the data as well as the addresses in main memory where the data comes from: we only need to invalidate caches that correspond to the same range of memory

Data gets brought into cache in chunks

Let’s say a program needs to read data from main memory. For example, let’s say it needs to read the variable named x. Let’s assume x is implemented as a 32-bit (4 byte) integer. When the CPU reads from main memory, the memory that holds the variable x will be brought into the cache.

But the CPU won’t just read the variable x into cache. It will read a contiguous chunk of memory that includes the variable x into cache. On x86 systems, the size of this chunk is 64 bytes. This means that accessing the 4 bytes that encodes the variable x actually ends up bringing 64 bytes along for the ride.

These chunks of memory stored in the cache are referred to as cache lines.

False sharing

We now almost have enough context to explain the failure mode. Here’s a C++ code snippet from the OpenJDK repository (from src/hotspot/share/oops/klass.hpp)

class Klass : public Metadata {
  ...
  // Cache of last observed secondary supertype
  Klass*      _secondary_super_cache;
  // Array of all secondary supertypes
  Array<Klass*>* _secondary_supers;

This declares two pointer variables inside of the Klass class: _secondary_super_cache, and _secondary_supers. Because these two variables are declared one after the other, they will get laid out next to each other in memory.

The two variables are adjacent in main memory.

The _secondary_super_cache is, itself, a cache. It’s a very small cache, one that holds a single value. It’s used in a code path for dynamically checking if a particular Java class is a subtype of another class. This code path isn’t commonly used, but it does happen for programs that dynamically create classes at runtime.

Now imagine the following scenario:

  1. There are two threads: T1 on CPU 1, T2 on CPU 2
  2. T1 wants to write the _secondary_super_cache variable and already has the memory associated with the _secondary_super_cache variable loaded in its L1 cache
  3. T2 wants to read from the _secondary_supers variable and already has the memory associated with the _secondary_supers variable loaded in its L1 cache.

When T1 (CPU 1) writes to _secondary_super_cache, if CPU 2 has the same block of memory loaded in its cache, then the cache controller will invalidate that cache line in CPU 2.

But if that cache line contained the _secondary_supers variable, then CPU 2 will have to reload that data from memory to do its read, which is slow.

ssc refers to _secondary_super_cache, ss refers to _secondary_supers

This phenomenon, where the cache controller invalidates cached non-stale data that a core needed to access, which just so happens to be on the same cache line as stale data, is called false sharing.

What’s the probability of false sharing in this scenario?

In this case, the two variables are both pointers. On this particular CPU architecture, pointers are 64-bits, or 8 bytes. The L1 cache line size is 64 bytes. That means a cache line can store 8 pointers. Or, put another away, a pointer can occupy one of 8 positions in the cache line.

There’s only one scenario where the two variables don’t end up on the same cache line: when _secondary_super_cache occupies position 8, and _secondary_supers occupies position 1. In all of the other scenarios, the two variables will occupy the same cache line, and hence will be vulnerable to false sharing.

1 / 8 = 12.5%, and that’s roughly the number of nodes that were observed in the low band in this scenario.

And now I recommend you take another look at the original blog post, which has a lot more details, including how they solved this problem, as well as a new problem that emerged once they fixed this one.

Writing docs well: why should a software engineer care?

Recently I gave a guest lecture in a graduate level software engineering course on the value of technical writing for software engineers. This post is a sort of rough transcript of my talk.

I live-sketched my slides as I went.

I talked about three goals of doing doing technical writing.

The first one is about building shared understanding among stakeholders of a document. One of the hardest problems in software engineering is getting multiple people to have a sufficient understanding of some technical aspect, like the actual problem being solved, or a proposed solution. This is ostensibly the only real goal of technical writings.

Shared understanding is related to the idea of common ground that you’ll sometimes hear the safety folks talk about.

If you’re a programmer who works completely alone, then this is a problem you generally don’t have to solve, because there’s only one person involved in the software project.

But as soon as you are working in a team, then you have to address the problem of shared understanding.

When we work on something technical, like software, we develop a much deeper understanding because we’re immersed in it. This can make communication hard when we’re talking to someone who hasn’t been working in the same area and so doesn’t have the same level of technical understanding of that particular bit.

If you’re working only with a small, co-located group (e.g., in a co-located startup), then having a discussion in front of a whiteboard is a very effective mechanism for building shared understanding. In this type of environment, writing effective technical docs is much less important.

The problem with the discuss-in-front-of-the-whiteboard approach is that it doesn’t scale up, and it also doesn’t work for distributed environments.

And this is where technical documents come in.

I like to say that the hardest problem in software engineering is getting the appropriate information into the heads of the people who need to have that information in order to do their work effectively.

In large organizations, a lot of the work is interconnected, which means that some work that somebody else is doing can affect your work. If you’re not aware of that, you can end up working at cross-purposes.

The challenge is that there’s so much potential information that might be useful. Everyone could potentially spend all of their working hours reading docs, and still not read everything that might be relevant.

To write a doc well means to get people to gain sufficient understanding so that you can coordinate work effectively.

The second goal of writing I talked about was using writing to help with your own thinking.

The cartoonist Richard Guindon has a famous quote: “writing is nature’s way of letting you know how sloppy your thinking is.” You might have an impression that you understand something well, but that sense of clarity is often an illusion, and when you go to explicitly capture your understanding in a document, you discover that you didn’t understand things as well as you thought. There’s nowhere to hide in your own document.

When writing technical docs, I always try hard to work explicitly through examples to demonstrate the concepts. This is one of the biggest weaknesses I see in practice in technical docs, that the author has not described a scenario from start to finish. Conceptually, you want your doc to have something like a storyboard that’s used in the film industry, to tell the story. Writing out a complete example will force you to confront the gaps in your understanding.

The third goal is a bit subversive: it’s how to use effective technical writing to have influence in a larger organization when you’re at the bottom of the hierarchy.

If you want influence, you likely have some sort of vision of where you want the broader organization to go, and the challenge is to persuade people of influence to move things closer to your vision.

Because senior leadership, like everyone else in the organization, only has a finite amount of time and attention, their view of reality are shaped by the interactions they do have: which is largely through meetings and documents. Effective technical documents shape the view of reality that leadership has, but only if they’re written well.

If you frame things right, you can make it seem as if your view is reality rather than simply your opinion. But this requires skill.

Software engineers often struggle to write effective docs. And that’s understandable, because writing effective technical docs is very difficult.

Anyone who has set down at a computer to write a doc and has stared at the blinking cursor at an empty doc knows how difficult it can be to just get started.

Even the best-written technical docs aren’t necessarily easy to read.

Poorly written docs are hard to read. However, just because a doc is hard to read, doesn’t mean it’s poorly written!

This talk is about technical writing, but technical reading is also a skill. Often, we can’t understand a paragraph in a technical document without having a good grasp of the surrounding context. But we also can’t understand the context without reading the individual paragraphs, not only of this document, but of other documents as well!

This means we often can’t understand a technical document by reading from beginning to end. We need to move back and forth between working to understand the text itself and working to understand the wider context. This pattern is known as the hermeneutic circle, and it is used in Biblical studies.

Finally, some pieces of advice on how to improve your technical writing.

Know explicitly in advance what your goal is in doing the writing. Writing to improve your own understanding is different from writing to improve someone else’s understanding, or to persuade someone else.

Make sure your technical document has concrete examples. These are the hardest to write, but they are most likely to help achieve your goals in your document.

Get feedback on your drafts from people that you trust. Even the best writers in the world benefit from having good editors.

Time keeps on slippin’: A review of “Four thousand weeks”

Four Thousand Weeks: Time Management for Mortals by Oliver Burkeman

Burkeman isn’t interested in helping you get more done. The problem, he says, is that attempting to be more productive is a trap. Instead, what he advocates is that you change your perspective to use your time *well*, rather than trying to get as much done as possible.

This is really an anti-productivity book, and a fantastic one at that. Burkeman urges us to embrace the fact that we only have a limited amount of time (“four thousand weeks” is an allusion to the average lifespan), and that we should embrace this limit rather than try to fight against it.

Holding yourself to impossible standards is a recipe for misery, he reminds us, whether it’s trying to complete all of the items on our todo lists or trying to be the person we ought to be rather than looking at who we actually are: what our actual strengths and weaknesses are, and what we genuinely enjoy doing.

The time mangement skills that Burkeman encourages are the ones that will reduce the amount of time pressure that we experience. Learn how to say “no” to the stuff that you want to do, but that you want to do less than the other stuff. Learn to make peace with the fact that you will always feel overwhelmed.

“Let go”, Burkeman urges us. After all, in the grand scheme of things, the work that we do doesn’t matter nearly as much as we think.

(Cross-posted to goodreads)