The danger of hidden functional roles

There’s a collection of friends that I have a standing videochat with every couple of weeks. We had been meeting at 8am, but several people developed conflicts at that time, including me. I have a teenager that starts school at 8am, and I’m responsible for getting them to school in the morning (I like to leave the house around 7:40am), which prevented me from participating.

As a group, we decided to reschedule the chat to 7am. This works well for me, because I get up at 6. Today was the first day meeting at the new scheduled time. I got up as I normally do, and was sure to be quiet so as not to wake my wife, Stacy; she sleeps later than I do, but she gets up early enough to rouse our kids for school. I even closed the bedroom door so that any noise I made from the videochat wouldn’t disturb her.

I was on the videochat, taking part in the conversation in hushed tones, when I looked over at the time. I saw it was 7:25am, which is about fifteen minutes before I start getting ready to leave the house. Usually, the rest of the household is up, showering, eating breakfast. But I hadn’t heard a peep from anyone. I went upstairs to discover that nobody else had gotten up yet.

It turns out that my typical morning routine was acting as a natural alarm clock for Stacy. My alarm goes off at 6am every weekday, and I get up, but Stacy stays in bed. However, the noises from my normal morning routine are the thing that rouse her, which is typically around 7am. Today, I was careful to be very quiet, and so she didn’t wake up. I didn’t know that I was functioning as an alarm clock for her! That’s why I was careful to be quiet, and why I didn’t even think to mention to her about the new videochat time.

I suspect this is failure mode is more common than we realize: there is a process inside a system, and over time the process comes to fulfill some unintended, ancillary functional role, and there are people who participate in this process that aren’t even aware of this function.

As an example, consider Chaos Monkey. Chaos Monkey’s intended function is to ensure that engineers design their services to withstand a virtual machine instance failing unexpectedly, by increasing their exposure to this failure mode. But Chaos Monkey also has the unintended effect of recycling instances. For teams that deploy very infrequently, their service might exhibit problems with long-lived instances that they never notice because Chaos Monkey tends to terminate instances before they hit those problems. Now imagine declaring an extended period of time where, in the interest of reducing risk(!), no new code is deployed, and Chaos Monkey is disabled.

When you turn something off, you never know what might break. In some cases, nobody in the system knows.

Plus c’est la même chose, plus ça change

I’m re-reading a David Woods’s paper titled the theory of graceful extensibility: basic rules that govern adaptive systems. The paper proposes a theory to explain how certain types of systems are able to adapt over and over again to changes in their environment. He calls this phenomenon sustained adaptability, which we contrasts with systems that can initially adapt to an environment but later collapse when some feature of the environment changes and they fail to adapt to the new change.

Woods outlines six requirements that any explanatory theory of sustained adaptability must have. Here’s the fourth one (emphasis in the original):

Fourth, a candidate theory needs to provide a positive means for a unit at any scale to adjust how it adapts in the pursuit of improved fitness (how it is well matched to its environment), as changes and challenges continue apace. And this capability must be centered on the limits and perspective of that unit at that scale.

The phrase adjust how it adapts really struck me. Since adaptation is a type of change, this is referring to a second-order change process: these adaptive units have the ability to change the mechanism by which they change themselves! This notion reminded me of Chris Argyris’s idea of double-loop learning.

Woods’s goal is to determine what properties a system must have, what type of architecture it needs, in order to achieve this second-order change process. He outlines in the paper that any such system must be a layered architecture of units that can adapt themselves and coordinate with each other, which he calls a tangled, layered network.

Woods believes there are properties that are fundamental to systems that exhibit sustained adaptability, which implies that these fundamental properties don’t change! A tangled, layered network may reconfigure itself in all sorts of different ways over time, but it must still be tangled and layered (and maintain other properties as well).

The more such systems stay the same, the more they change.

Adapting to a crunch: the Mask Match story

I just got back from Strange Loop, and my favorite talk was Tech When the Sky is Falling: Tools for Crisis Response by Emma Ferguson and Colin Schimmelfing. I’m going to use this talk to illustrate one of the ideas in David Woods theory of graceful extensibility. The idea is that a system needs to deploy, mobilize, or generate capacity when it is at risk of saturation.

My silly doodle of the speakers

Back in March 2020, frontline hospital workers dealing with COVID-19 patients were running short on N95 face masks. Hospitals simply didn’t have enough masks to supply their workers. This shortage of masks is a great example of what Woods calls a crunch, where a system runs short on some resource that it needs. When a system is crunched like this, it needs to adapt. It has to make some sort of change in order to get more of that resource so that it can function properly.

Woods lists three methods for getting more of a resource. If you’ve prepared in advance by stockpiling resources, you can deploy those stockpiles. If you don’t have those extra resources on hand, but your larger network has resources to spare, you can mobilize your network to access those resources. Finally, if you can’t tap into your network to get those resources, your only option is to generate the resources you need. In order to generate resources, you need access to raw materials, and then you need to do work to transform those raw materials into the right resources.

In the case of the mask shortages, the hospitals did not have sufficient stockpiles of N95 masks on hand, so deploying wasn’t an option. It turns out that there were many American households that happened to have N95 masks sitting in storage, and many of those households were willing to donate these unused masks to healthcare workers. In theory, hospitals could mobilize this network of volunteers in order to get these masks to the frontline workers.

There was a problem, though: hospital administrators refused to accept donated N95 masks because of liability concerns. So, this wasn’t something the hospitals were going to do.

Workers wanted masks, and people wanted to donate, but hospital admins wouldn’t let them

Fortunately, there was a loophole: frontline workers could bring in their own masks. Now, the problem to be solved was: how do you get masks from donors who had masks to the workers who wanted them?

Emma and Colin needed to generate a new capability: a mechanism for matching up the donors with the healthcare workers. The raw materials that they initially used to generate this capability were Google Sheets and Gmail for coordinating among the volunteers.

And it worked! However, they quickly ran into a new risk of saturation. Google Sheets has a limit of 50 concurrent editors, and Gmail limits an email account to a maximum of 500 emails per day. And so, once again, the team had to generate a new capability that would scale beyond what Google Sheets and Gmail were capable of. They ended up building a system called Mask Match, by writing a Flask app that they deployed on Heroku, and using Mailgun for sending the emails.

My favorite part of this talk was when Emma Ferguson mentioned that they originally just wanted to pay Google in order to get the Google Sheets and Gmail limits increased (their GoFundMe campaign was quite successful, so getting access to money wasn’t a problem for them). However, they couldn’t figure out how to actually pay Google for a limit increase! This is a wonderful example of what Woods calls brittleness, where a system is unable to extend itself when it reaches its limits. Google is great at building robust systems, but their ethos of removing humans from the loop means that it’s more difficult for consumers of Google services to adapt them to unexpected, emergency scenarios.

The strange beauty of strange loop failure modes

As I’ve posted about previously, at my day job, I work on a project called Managed Delivery. When I first joined the team, I was a little horrified to learn that the service that powers Managed Delivery deploy itself using Managed Delivery.

“How dangerous!”, I thought. What if we push out a change that breaks Managed Delivery? How will we recover? However, after having been on the team for over a year now, I have a newfound appreciation for this approach.

Yes, sometimes there’s something that breaks, and that makes it harder to roll back, because Managed Delivery provides the main functionality for easy rollback. However, it also means that the team gets quite a bit of practice at bypassing Managed Delivery when something goes wrong. They know how to disable Managed Delivery and use the traditional Spinnaker UI to deploy an older version. They know how to poke and prod at the database if the Managed Delivery UI doesn’t respond properly.

These strange loop failure modes are real: if Managed Delivery breaks, we may lose out on the functionality of Managed Delivery to help us recover. But it also means that we’re more ready for handling the situation if something with Managed Delivery goes awry. Yes, Managed Delivery depends on itself, and that’s odd. But we have experience with how to handle things when this strange loop dependency creates a problem. And that is a valuable thing.

Live-drawing my slides during a talk

The other day, I gave an internal talk, and I tried an experiment. Using my iPad and the GoodNotes app, I drew all of my slides while I was talking (except the first slide, which I drew in advance).

“What font is that?” someone asked. It’s my handwriting

I’ve always been in awe of people who can draw, I’ve never been good at it.

“Where’s the bug”, it says. Not my best handwriting

Over the years, I’ve tried doodling more. I was influenced by Dan Roam’s books, Julia Evans’s zines, sketchnotes, and most recently, Christina Wodtke’s Pencil Me In.

The words have stink lines, so you know they’re bad

If you’ve read my blog before, you’ve seen some of my previous doodles (e.g., Root cause of failure, root cause of success or Taming complexity: from contract to compact).

We need to complete the action items so it never happens again

When I was asked to present to a team, I wanted to use my drawings rather than do traditional slides. I actually hate using tools like PowerPoint and Google Slides to do presentations. Typically I use Deckset, but in this case, I wanted to do them all drawn.

A different perspective on incidents

I started off by drawing out my slides in advance. But then I thought, “instead of showing pre-drawn slides, why don’t I draw the slides as I talk? That way, people will know where to look because they’ll look at where I’m drawing.”

I still had to prepare the presentation in advance. I drew all of the slides beforehand. And then I printed them out and had them in front of me so that I could re-draw them during the talk. Since it was done over Zoom, people couldn’t actually see that I was working from the print-outs (although they might have heard the paper rustling).

Contributing factors aren’t like root cause

One benefit of this technique was that it made it easier to answer questions, because I could draw out my answer. When I was writing the text at the top, somebody asked, “Is that something like a root cause chain?” I drew the boxes and arrows in response, to explain how this isn’t chain-like, but instead is more like a web.

The selected images above should give you a sense of what my slides looked like. I had fun doing the presentation, and I’d try this approach again. It was certainly more enjoyable than futzing with slide layout.

Useful knowledge and improvisation

Eric Dobbs recently retold a story on twitter (a copy is on his wiki) about one of his former New Relic colleagues, Nicholas Valler.

At the time, Nicholas was new to the company. He had just discovered a security vulnerability, and then (unrelated to that security vulnerability), an incident happened and, well, I encourage you to read the whole story first, and then come back to this post.

In the end, the engineers were able to leverage the security vulnerability to help resolve the incident. As is my wont, I made a snarky comment.

But I did want to make a more serious comment about what this story illustrates. In a narrow sense, this security vulnerability helped the New Relic engineers remediate before there was severe impact. But in a broader sense, the following aspects helped them remediate:

  • they had useful knowledge of some aspect of the the system (port 22 was open to the world)
  • they could leverage that knowledge to improvise a solution (they could use this security hole to log in and make changes to the kafka configuration)

The irony here is that it was a new employee that had the useful knowledge. Typically, it’s the tenured engineers who have this sort of knowledge, as they’ve accumulated it with experience. In this case, the engineer discovered this knowledge right before it was needed. That’s what make this such a great story!

I do think that how Nicholas found it, by “poking around”, is a behavior that comes with general experience, even though he didn’t have much experience at the company.

But being in possession of useful knowledge isn’t enough. You also need to be able to recognize when the knowledge is useful and bring it to bear.

These two attributes: having useful knowledge about the system and the ability to apply that knowledge to improvise a solution, are critical for being able to deal effectively with incidents. Applying these are resilience in action.

It’s not a focus of this particular story, but, in general, this sort of knowledge is distributed across individuals. This means that it’s the ad-hoc team that forms during an incident that needs to possess these attributes.

Remembering the important bits when you need them

I’m working my way through the Cambridge Handbook of Expertise and Expert Performance, which is a collection of essays from academic researchers who study expertise.

Chapter 6 discusses the ability of experts to recall information that’s relevant to the task at hand. This is one of the differences between experts and novices: a novice might answer questions about a subject correctly on a test, but when faced with a real problem that requires that knowledge, they aren’t able to retrieve it.

The researchers K. Anders Ericsson and Walter Kintsch had an interesting theory about how experts do better at this than novices. The theory goes like this: when an expert encounters some new bit of information, they have the ability to encode that information into their long-term memory in association with a collection of cues of when that information would be relevant.

In other words, experts are able to predict the context when that information might be relevant in the future, and are able to use that contextual information as a kind of key that they can use to retrieve the information later on.

Now, think about reading an incident write-up. You might learn about a novel failure mode in some subsystem your company uses (say, a database), as well as the details that led up to it happening, including some of the weird, anomalous signals that were seen earlier on. If you have expertise in operations, you’ll encode information about the failure mode into your long term memory and associate it with the symptoms. So, the next time you see those symptoms in production, you’ll remember this failure mode.

This will only work if the incident write-up has enough detail to provide you with the cues that you need to encode in your memory. This is another reason to provide a rich description of the incident. Because the people reading it, if they’re good at operations, will encode the details of the failure mode into their memory. If it happens again, and they read the write up, they’ll remember.

Root cause of failure, root cause of success

Here are a couple of tweets from John Allspaw.

Succeeding at a project in an organization is like pushing a boulder up a hill that is too heavy for any single person to lift.

A team working together to successfully move a boulder to the top of the hill

It doesn’t make sense to ask what the “root cause of success” is for an effort like this, because it’s a collaboration that requires the work of many different people to succeed. It’s not meaningful to single out a particular individual as the reason the boulder made it to the top.

Now, let’s imagine that the team got the boulder to the top of the hill, and balanced it precariously at the summit, maybe with some supports to keep it from tumbling down again.

The boulder made it to the top!

Next, imagine that there’s a nearby baseball field, and some kid whacks a fly ball that strikes one of the supports, and the rock tumbles down.

In comes the ball, down goes the boulder

This, I think, is how people tend to view failure in systems. A perturbation comes along, strikes the system, and the system falls over. We associate the root cause with this perturbation.

In a way, our systems are like a boulder precariously balanced at the top of a hill. But this view is incomplete. Because what’s keeping the complex system boulder balanced is not a collection of passive supports. Instead, there are a number of active processes, like a group of people that are constantly watching the boulder to see if it starts to slip, and applying force to keep it balanced.

A collection of people watching the boulder and pushing on it to keep it from falling

Any successful complex system will have evolved these sorts of dynamic processes. These are what keep the system from falling over every time a kid hits a stray ball.

Note that it’s not the case that all of these processes have to be working for the boulder to stay up. The boulder won’t fall just because someone let their guard down for a moment, or even if one person happened to be absent one day; the boulder would never stay up if it required everyone to behave perfectly all of the time. Because it’s a group of people keeping it balanced, there is redundancy: one person can compensate for another person who falters.

But this keeping-the-boulder-balanced system isn’t perfect. Maybe something comes out of the sky and strikes the boulder with an enormous amount of force. Or maybe several people are sluggish today because they’re sick. Or maybe it rained and the surface of the hill is much slipperier, making it more difficult to navigate. Maybe it’s a combination of all of these.

When the boulder falls, it means that the collection of processes weren’t able to compensate for the disturbance. But there’s no single problem, no root cause, that you can point to, because it’s the collection of these processes working together that normally keep the boulder up.

This is why “root cause of failure” doesn’t make sense in the context of complex systems failure, because a collection of control processes keep the system up and running. A system failure is a failure of this overall set of processes. It’s just not meaningful to single out a problem with one of these processes after an incident, because that process is just one of many, and it failing alone couldn’t have brought down the system.

What makes things even trickier is that some of these processes are invisible, even to the people inside of the system. We don’t see the monitoring and adjustment that is going on around us. Which means we won’t notice if some of these control processes stop happening.

Burned by ‘let it burn’

Here are some excerpts from a story from the L.A. Times, with the headline: Forest Service changes ‘let it burn’ policy following criticism from western politicians (emphasis mine)

Facing criticism over its practice of monitoring some fires rather than quickly snuffing them out, the U.S. Forest Service has told its firefighters to halt the policy this year to better prioritize resources and help prevent small blazes from growing into uncontrollable conflagrations.

The [Tamarack] fire began as a July 4 lightning strike on a single tree in the Mokelumne Wilderness, a rugged area southeast of Sacramento. Forest officials decided to monitor it rather than attempt to put it out, a decision a spokeswoman said was based on scant resources and the remote location. But the blaze continued to grow, eventually consuming nearly 69,000 acres, destroying homes and causing mass evacuations. It is now 82% contained.

Instead of letting some naturally caused small blazes burn, the agency’s priorities will shift this year, U.S. Forest Service Chief Randy Moore indicated to the staff in a letter Monday. The focus, he said, will be on firefighter and public safety.

The U.S. Forest Service had to make a call about whether to put out a fire or to monitor it and let it burn out. In this case, they decided to monitor it, and the fire grew out of control.

Now, imagine an alternate universe where the Forest Service spent some of its scant resources on putting out this fire, and then another fire popped up somewhere else, and they didn’t have the resources to fight that one effectively, and it went out of control. The news coverage would, undoubtedly, be equally unkind.

Practitioners often must make risk trade-offs in the moment, when there is a high amount of uncertainty. What was the risk that the fire would grow out of control? How does it stack up against the risk of being short staffed if you send out firefighters to put out a small fire and a large one breaks out elsewhere?

Towards the middle of the piece, the article goes into some detail about the issue of limited resources.

[Agriculture Secretary Tom] Vilsack promised more federal aid and cooperation for California’s plight, acknowledging concerns about past practices while also stressing that, with dozens of fires burning across the West and months to go in a prolonged fire season, there are not enough resources to put them all out.

“Candidly I think it’s fair to say, over the generations, over the decades, we have tried to do this job on the cheap,” Vilsack said. “We’ve tried to get by, a little bit here, a little bit there, a little forest management over here, a little fire suppression over here. But the reality is this has caught up with us, which is why we have an extraordinary number of catastrophic fires and why we have to significantly beef up our capacity.”

Vilsack said that the bipartisan infrastructure bill working its way through Congress would provide some of those resources but that ultimately it would take “billions” of dollars and years of catch-up to create fire-resilient forests.

The U.S. Forest Service’s policy on allowing unplanned wildfires to burn differs from the California Department of Forestry and Fire Protection, and I’m not a domain expert, so I don’t have an informed opinion. But this isn’t just a story about policy, it’s a story about saturation. It’s also about what’s allowed (and not allowed) to count as a cause.

Controlling a process we don’t understand

I was attending the Resilience Engineering Association – Naturalistic Decision Making Symposium last month, and one of the talks was by a medical doctor (an anesthesiologist) who was talking about analyzing incidents in anesthesiology. I immediately thought of Dr. Richard Cook, who is also an anesthesiologist, who has been very active in the field of resilience engineering, and I wondered, “what is it with anesthesiology and resilience engineering?” And then it hit me: it’s about process control.

As software engineers in the field we call “tech”, we often discuss whether we are really engineers in the same sense that a civil engineer is. But, upon reflection I actually think that’s the wrong question to ask. Instead, we should consider the fields there where practitioners are responsible for controlling a dynamic process that’s too complex for humans to fully understand. This type of work involves fields such as spaceflight, aviation, maritime, chemical engineering, power generation (nuclear power in particular), anesthesiology, and, yes, operating software services in the cloud.

We all have displays to look at to tell us the current state of things, alerts that tell us something is going wrong, and knobs that we can fiddle with when we need to intervene in order to bring the process back into a healthy state. We all feel production pressure, are faced with ambiguity (is that blip really a problem?), are faced with high-pressure situations, and have to make consequential decisions under very high degrees of uncertainty.

Whether we are engineers or not doesn’t matter. We’re all operators doing our best to bring complex systems under our control. We face similar challenges, and we should recognize that. That is why I’m so fascinated by fields like cognitive systems engineering and resilience engineering. Because it’s so damned relevant to the kind of work that we do in the world of building and operating cloud services.