People like to put dollar values on outages. There’s no true answer to the question of how much an outage costs an organization. If your company is transaction-based, you can estimate how many transactions were missed, but there are all sorts of other factors that you could decide to model. (Are those transactions really lost, or will people come back later? Does it impact the public’s perception of the organization? What if your business isn’t transaction-based?). If you ask John Allspaw, he’ll tell you that incidents can provide benefits to an organization in addition to costs.
Putting all of that aside for now, one question around incident cost that I think is interesting is the perceived cost within the organization. How costly does leadership feel that this incident was?
Here’s a proposed approach to try and capture this. After an incident, go to different people in leadership, and ask them the following question:
Imagine I could wave a magic wand, and it would alter history to undo the incident: it will be as if the incident never happened. However, I’ll only do this if you pay me money out of your org’s budget. How much are you willing to pay me to wave the wand?
I think this would be an interesting way to convey to the organization how leadership perceived the incident.
There’s a collection of friends that I have a standing videochat with every couple of weeks. We had been meeting at 8am, but several people developed conflicts at that time, including me. I have a teenager that starts school at 8am, and I’m responsible for getting them to school in the morning (I like to leave the house around 7:40am), which prevented me from participating.
As a group, we decided to reschedule the chat to 7am. This works well for me, because I get up at 6. Today was the first day meeting at the new scheduled time. I got up as I normally do, and was sure to be quiet so as not to wake my wife, Stacy; she sleeps later than I do, but she gets up early enough to rouse our kids for school. I even closed the bedroom door so that any noise I made from the videochat wouldn’t disturb her.
I was on the videochat, taking part in the conversation in hushed tones, when I looked over at the time. I saw it was 7:25am, which is about fifteen minutes before I start getting ready to leave the house. Usually, the rest of the household is up, showering, eating breakfast. But I hadn’t heard a peep from anyone. I went upstairs to discover that nobody else had gotten up yet.
It turns out that my typical morning routine was acting as a natural alarm clock for Stacy. My alarm goes off at 6am every weekday, and I get up, but Stacy stays in bed. However, the noises from my normal morning routine are the thing that rouse her, which is typically around 7am. Today, I was careful to be very quiet, and so she didn’t wake up. I didn’t know that I was functioning as an alarm clock for her! That’s why I was careful to be quiet, and why I didn’t even think to mention to her about the new videochat time.
I suspect this is failure mode is more common than we realize: there is a process inside a system, and over time the process comes to fulfill some unintended, ancillary functional role, and there are people who participate in this process that aren’t even aware of this function.
As an example, consider Chaos Monkey. Chaos Monkey’s intended function is to ensure that engineers design their services to withstand a virtual machine instance failing unexpectedly, by increasing their exposure to this failure mode. But Chaos Monkey also has the unintended effect of recycling instances. For teams that deploy very infrequently, their service might exhibit problems with long-lived instances that they never notice because Chaos Monkey tends to terminate instances before they hit those problems. Now imagine declaring an extended period of time where, in the interest of reducing risk(!), no new code is deployed, and Chaos Monkey is disabled.
When you turn something off, you never know what might break. In some cases, nobody in the system knows.
I’m re-reading a David Woods’s paper titled the theory of graceful extensibility: basic rules that govern adaptive systems. The paper proposes a theory to explain how certain types of systems are able to adapt over and over again to changes in their environment. He calls this phenomenon sustained adaptability, which we contrasts with systems that can initially adapt to an environment but later collapse when some feature of the environment changes and they fail to adapt to the new change.
Woods outlines six requirements that any explanatory theory of sustained adaptability must have. Here’s the fourth one (emphasis in the original):
Fourth, a candidate theory needs to provide a positive means for a unit at any scale to adjust how it adapts in the pursuit of improved fitness (how it is well matched to its environment), as changes and challenges continue apace. And this capability must be centered on the limits and perspective of that unit at that scale.
The phrase “adjust how it adapts” really struck me. Since adaptation is a type of change, this is referring to a second-order change process: these adaptive units have the ability to change the mechanism by which they change themselves! This notion reminded me of Chris Argyris’s idea of double-looplearning.
Woods’s goal is to determine what properties a system must have, what type of architecture it needs, in order to achieve this second-order change process. He outlines in the paper that any such system must be a layered architecture of units that can adapt themselves and coordinate with each other, which he calls a tangled, layered network.
Woods believes there are properties that are fundamental to systems that exhibit sustained adaptability, which implies that these fundamental properties don’t change! A tangled, layered network may reconfigure itself in all sorts of different ways over time, but it must still be tangled and layered (and maintain other properties as well).
The more such systems stay the same, the more they change.
Back in March 2020, frontline hospital workers dealing with COVID-19 patients were running short on N95 face masks. Hospitals simply didn’t have enough masks to supply their workers. This shortage of masks is a great example of what Woods calls a crunch, where a system runs short on some resource that it needs. When a system is crunched like this, it needs to adapt. It has to make some sort of change in order to get more of that resource so that it can function properly.
Woods lists three methods for getting more of a resource. If you’ve prepared in advance by stockpiling resources, you can deploy those stockpiles. If you don’t have those extra resources on hand, but your larger network has resources to spare, you can mobilize your network to access those resources. Finally, if you can’t tap into your network to get those resources, your only option is to generate the resources you need. In order to generate resources, you need access to raw materials, and then you need to do work to transform those raw materials into the right resources.
In the case of the mask shortages, the hospitals did not have sufficient stockpiles of N95 masks on hand, so deployingwasn’t an option. It turns out that there were many American households that happened to have N95 masks sitting in storage, and many of those households were willing to donate these unused masks to healthcare workers. In theory, hospitals could mobilize this network of volunteers in order to get these masks to the frontline workers.
There was a problem, though: hospital administrators refused to accept donated N95 masks because of liability concerns. So, this wasn’t something the hospitals were going to do.
Fortunately, there was a loophole: frontline workers could bring in their own masks. Now, the problem to be solved was: how do you get masks from donors who had masks to the workers who wanted them?
Emma and Colin needed to generate a new capability: a mechanism for matching up the donors with the healthcare workers. The raw materials that they initially used to generate this capability were Google Sheets and Gmail for coordinating among the volunteers.
And it worked! However, they quickly ran into a new risk of saturation. Google Sheets has a limit of 50 concurrent editors, and Gmail limits an email account to a maximum of 500 emails per day. And so, once again, the team had to generate a new capability that would scale beyond what Google Sheets and Gmail were capable of. They ended up building a system called Mask Match, by writing a Flask app that they deployed on Heroku, and using Mailgun for sending the emails.
My favorite part of this talk was when Emma Ferguson mentioned that they originally just wanted to pay Google in order to get the Google Sheets and Gmail limits increased (their GoFundMe campaign was quite successful, so getting access to money wasn’t a problem for them). However, they couldn’t figure out how to actually pay Google for a limit increase! This is a wonderful example of what Woods calls brittleness, where a system is unable to extend itself when it reaches its limits. Google is great at building robust systems, but their ethos of removing humans from the loop means that it’s more difficult for consumers of Google services to adapt them to unexpected, emergency scenarios.
As I’ve posted about previously, at my day job, I work on a project called Managed Delivery. When I first joined the team, I was a little horrified to learn that the service that powers Managed Delivery deploy itself using Managed Delivery.
“How dangerous!”, I thought. What if we push out a change that breaks Managed Delivery? How will we recover? However, after having been on the team for over a year now, I have a newfound appreciation for this approach.
Yes, sometimes there’s something that breaks, and that makes it harder to roll back, because Managed Delivery provides the main functionality for easy rollback. However, it also means that the team gets quite a bit of practice at bypassing Managed Delivery when something goes wrong. They know how to disable Managed Delivery and use the traditional Spinnaker UI to deploy an older version. They know how to poke and prod at the database if the Managed Delivery UI doesn’t respond properly.
These strange loop failure modes are real: if Managed Delivery breaks, we may lose out on the functionality of Managed Delivery to help us recover. But it also means that we’re more ready for handling the situation if something with Managed Delivery goes awry. Yes, Managed Delivery depends on itself, and that’s odd. But we have experience with how to handle things when this strange loop dependency creates a problem. And that is a valuable thing.
When I was asked to present to a team, I wanted to use my drawings rather than do traditional slides. I actually hate using tools like PowerPoint and Google Slides to do presentations. Typically I use Deckset, but in this case, I wanted to do them all drawn.
I started off by drawing out my slides in advance. But then I thought, “instead of showing pre-drawn slides, why don’t I draw the slides as I talk? That way, people will know where to look because they’ll look at where I’m drawing.”
I still had to prepare the presentation in advance. I drew all of the slides beforehand. And then I printed them out and had them in front of me so that I could re-draw them during the talk. Since it was done over Zoom, people couldn’t actually see that I was working from the print-outs (although they might have heard the paper rustling).
One benefit of this technique was that it made it easier to answer questions, because I could draw out my answer. When I was writing the text at the top, somebody asked, “Is that something like a root cause chain?” I drew the boxes and arrows in response, to explain how this isn’t chain-like, but instead is more like a web.
The selected images above should give you a sense of what my slides looked like. I had fun doing the presentation, and I’d try this approach again. It was certainly more enjoyable than futzing with slide layout.
The trigger was a change to a logging statement, in order to log an exception. During the incident, the engineers noticed that this change lined up with the time that the alerts fired. But, other than the timing, there wasn’t any evidence to suggest that the log statement change was problematic. The change didn’t have any apparent relationship to the symptoms they were seeing with the job runner, which was in a different part of the codebase. And so they assumed that the logging statement change was innocuous.
As it happened, there was a coupling between that log statement and the job runner. Unfortunately for the engineers, this coupling was effectively invisible to them. The connection between the logging statement and the job running was Mailchimp’s log processing pipeline. Here’s an excerpt from the writeup (emphasis mine):
Our log processing pipeline does a bit of normalization to ensure that logs are formatted consistently; a quirk of this processing code meant that trying to log a PHP object that is Iterable would result in that object’s iterator methods being invoked (for example, to normalize the log format of an Array).
Normally, this is an innocuous behavior—but in our case, the harmless logging change that had shipped at the start of the incident was attempting to log PHP exception objects. Since they were occurring during job execution, these exceptions held a stacktrace that included the method the job runner uses to claim jobs for execution (“locking”)—meaning that each time one of these exceptions made it into the logs, the logging pipeline itself was invoking the job runner’s methods and locking jobs that would never be actually run!
Fortunately, there were engineers who had experience with this failure mode before:
Since the whole company had visibility into our progress on the incident, a couple of engineers who had been observing realized that they’d seen this exact kind of issue some years before.
Having identified the cause, we quickly reverted the not-so-harmless logging change, and our systems very quickly returned to normal.
In the moment, the engineers could not conceive of how a change in behavior in the job runner could be affected by the modification of a log statement in an unrelated part of the code. It was literally unthinkable to them.
The clip below is from a lecture from 2008(?) that then-Google CEO Eric Schmidt gave to a Stanford class.
Here’s a transcript, emphasis mine.
When I was at Novell, I had learned that there were people who I call “glue people”. The glue people are incredibly nice people who sit at interstitial boundaries between groups, and they assist in activity. And they are very, very loyal, and people love them, and you don’t need them at all.
At Novell, I kept trying to get rid of these glue people, because they were getting in the way, because they slowed everything down. And every time I get rid of them in one group, they’d show up in another group, and they’d transfer, and get rehired and all that.
I was telling Larry [Page] and Sergey [Brin] this one day, and Larry said, “I don’t understand what your problem is. Why don’t we just review all of the hiring?” And I said, “What?” And Larry said, “Yeah, let’s just review all the hiring.” And I said, “Really?” He said, “Yes”.
So, guess what? From that moment on, we reviewed every offer packet, literally every one. And anybody who smelled or looked like a glue person, plus the people that Larry and Sergey thought had backgrounds that I liked that they didn’t, would all be fired.
I first watched this lecture years ago, but Schmidt’s expressed contempt for the nice and loyalbut useless glue people just got lodged in my brain, and I’ve never forgotten it. For some reason, this tweet about Google’s various messaging services sparked my memory about it, hence this post.
I’m enjoying Marianne Bellotti’s book Kill It With Fire, which is a kind of guidebook for software modernization projects (think: migrating legacy systems). In Chapter Five, she talks about the importance of momentum for success, and how a crisis can be a valuable tool for creating a sense of urgency. This is the passage that really resonated with me (emphasis in the original):
Occasionally, I went as far as looking for a crisis to draw attention to. This usually didn’t require too much effort. Any system more than five years old will have at least a couple major things wrong with it. It didn’t mean lying, and it didn’t mean injecting problems where they didn’t exist. Instead, it was a matter of storytelling—taking something that was unreported and highlighting its potential risks. These problems were problems, and my analysis of their potential impact was always truthful, but some of them could have easily stayed buried for months or years without triggering a single incident.
Kill It With Fire, p88
This is a great explanation of how describing a problem is a form of power in an organization. Bellotti demonstrates how, by telling a story, she was able to make problems real for an organization, even to the point of creating a crisis. And a crisis receives attention and resources. Crises get resolved.
It’s also a great example the importance of storytelling in technical organizations. Tell a good story, and you can make things happen. It’s a skill that’s worth getting better at.
There’s a wonderful book by the political philosopher Marshall Berman called All That is Solid Melts Into Air. The subtitle of the book is the experience of modernity, and, indeed, the book tries to capture the feeling of what is like to live in the modern period, as illustrated through the writings of famous modernist authors, both fiction and non.
Berman demonstrates how the modern era, particularly in the late 19th and early 20th century, was a time period of great ferment. The world was seen as turbulent, dynamic. People believed that the world we lived in was not a fixed entity, but that it could be reshaped, remade entirely. The title of a book is a quote from Karl Marx, who was alluding to the idea that all of the structures we see in the world are ephemeral.
In contrast, in the post-modernist view of the world that came later, we can never cast off our history and start from scratch. Every complex system has a history, and that history continues to constrain the behavior of the system, even though it undergoes change.
We software engineers are modernists at heart. We see the legacy systems in our organizations and believe that, when we have the opportunity to work on replacement systems, we will remake our little corner of the world anew. Alas, on this point, the post-modernists were right. While we can change our systems, even replace subsystems wholesale, we can never fully escape the past. We ignore the history of the system at our peril.