When I was asked to present to a team, I wanted to use my drawings rather than do traditional slides. I actually hate using tools like PowerPoint and Google Slides to do presentations. Typically I use Deckset, but in this case, I wanted to do them all drawn.
I started off by drawing out my slides in advance. But then I thought, “instead of showing pre-drawn slides, why don’t I draw the slides as I talk? That way, people will know where to look because they’ll look at where I’m drawing.”
I still had to prepare the presentation in advance. I drew all of the slides beforehand. And then I printed them out and had them in front of me so that I could re-draw them during the talk. Since it was done over Zoom, people couldn’t actually see that I was working from the print-outs (although they might have heard the paper rustling).
One benefit of this technique was that it made it easier to answer questions, because I could draw out my answer. When I was writing the text at the top, somebody asked, “Is that something like a root cause chain?” I drew the boxes and arrows in response, to explain how this isn’t chain-like, but instead is more like a web.
The selected images above should give you a sense of what my slides looked like. I had fun doing the presentation, and I’d try this approach again. It was certainly more enjoyable than futzing with slide layout.
The trigger was a change to a logging statement, in order to log an exception. During the incident, the engineers noticed that this change lined up with the time that the alerts fired. But, other than the timing, there wasn’t any evidence to suggest that the log statement change was problematic. The change didn’t have any apparent relationship to the symptoms they were seeing with the job runner, which was in a different part of the codebase. And so they assumed that the logging statement change was innocuous.
As it happened, there was a coupling between that log statement and the job runner. Unfortunately for the engineers, this coupling was effectively invisible to them. The connection between the logging statement and the job running was Mailchimp’s log processing pipeline. Here’s an excerpt from the writeup (emphasis mine):
Our log processing pipeline does a bit of normalization to ensure that logs are formatted consistently; a quirk of this processing code meant that trying to log a PHP object that is Iterable would result in that object’s iterator methods being invoked (for example, to normalize the log format of an Array).
Normally, this is an innocuous behavior—but in our case, the harmless logging change that had shipped at the start of the incident was attempting to log PHP exception objects. Since they were occurring during job execution, these exceptions held a stacktrace that included the method the job runner uses to claim jobs for execution (“locking”)—meaning that each time one of these exceptions made it into the logs, the logging pipeline itself was invoking the job runner’s methods and locking jobs that would never be actually run!
Fortunately, there were engineers who had experience with this failure mode before:
Since the whole company had visibility into our progress on the incident, a couple of engineers who had been observing realized that they’d seen this exact kind of issue some years before.
Having identified the cause, we quickly reverted the not-so-harmless logging change, and our systems very quickly returned to normal.
In the moment, the engineers could not conceive of how a change in behavior in the job runner could be affected by the modification of a log statement in an unrelated part of the code. It was literally unthinkable to them.
The clip below is from a lecture from 2008(?) that then-Google CEO Eric Schmidt gave to a Stanford class.
Here’s a transcript, emphasis mine.
When I was at Novell, I had learned that there were people who I call “glue people”. The glue people are incredibly nice people who sit at interstitial boundaries between groups, and they assist in activity. And they are very, very loyal, and people love them, and you don’t need them at all.
At Novell, I kept trying to get rid of these glue people, because they were getting in the way, because they slowed everything down. And every time I get rid of them in one group, they’d show up in another group, and they’d transfer, and get rehired and all that.
I was telling Larry [Page] and Sergey [Brin] this one day, and Larry said, “I don’t understand what your problem is. Why don’t we just review all of the hiring?” And I said, “What?” And Larry said, “Yeah, let’s just review all the hiring.” And I said, “Really?” He said, “Yes”.
So, guess what? From that moment on, we reviewed every offer packet, literally every one. And anybody who smelled or looked like a glue person, plus the people that Larry and Sergey thought had backgrounds that I liked that they didn’t, would all be fired.
I first watched this lecture years ago, but Schmidt’s expressed contempt for the nice and loyalbut useless glue people just got lodged in my brain, and I’ve never forgotten it. For some reason, this tweet about Google’s various messaging services sparked my memory about it, hence this post.
I’m enjoying Marianne Bellotti’s book Kill It With Fire, which is a kind of guidebook for software modernization projects (think: migrating legacy systems). In Chapter Five, she talks about the importance of momentum for success, and how a crisis can be a valuable tool for creating a sense of urgency. This is the passage that really resonated with me (emphasis in the original):
Occasionally, I went as far as looking for a crisis to draw attention to. This usually didn’t require too much effort. Any system more than five years old will have at least a couple major things wrong with it. It didn’t mean lying, and it didn’t mean injecting problems where they didn’t exist. Instead, it was a matter of storytelling—taking something that was unreported and highlighting its potential risks. These problems were problems, and my analysis of their potential impact was always truthful, but some of them could have easily stayed buried for months or years without triggering a single incident.
Kill It With Fire, p88
This is a great explanation of how describing a problem is a form of power in an organization. Bellotti demonstrates how, by telling a story, she was able to make problems real for an organization, even to the point of creating a crisis. And a crisis receives attention and resources. Crises get resolved.
It’s also a great example the importance of storytelling in technical organizations. Tell a good story, and you can make things happen. It’s a skill that’s worth getting better at.
There’s a wonderful book by the political philosopher Marshall Berman called All That is Solid Melts Into Air. The subtitle of the book is the experience of modernity, and, indeed, the book tries to capture the feeling of what is like to live in the modern period, as illustrated through the writings of famous modernist authors, both fiction and non.
Berman demonstrates how the modern era, particularly in the late 19th and early 20th century, was a time period of great ferment. The world was seen as turbulent, dynamic. People believed that the world we lived in was not a fixed entity, but that it could be reshaped, remade entirely. The title of a book is a quote from Karl Marx, who was alluding to the idea that all of the structures we see in the world are ephemeral.
In contrast, in the post-modernist view of the world that came later, we can never cast off our history and start from scratch. Every complex system has a history, and that history continues to constrain the behavior of the system, even though it undergoes change.
We software engineers are modernists at heart. We see the legacy systems in our organizations and believe that, when we have the opportunity to work on replacement systems, we will remake our little corner of the world anew. Alas, on this point, the post-modernists were right. While we can change our systems, even replace subsystems wholesale, we can never fully escape the past. We ignore the history of the system at our peril.
Succeeding at a project in an organization is like pushing a boulder up a hill that is too heavy for any single person to lift.
It doesn’t make sense to ask what the “root cause of success” is for an effort like this, because it’s a collaboration that requires the work of many different people to succeed. It’s not meaningful to single out a particular individual as the reason the boulder made it to the top.
Now, let’s imagine that the team got the boulder to the top of the hill, and balanced it precariously at the summit, maybe with some supports to keep it from tumbling down again.
Next, imagine that there’s a nearby baseball field, and some kid whacks a fly ball that strikes one of the supports, and the rock tumbles down.
This, I think, is how people tend to view failure in systems. A perturbation comes along, strikes the system, and the system falls over. We associate the root cause with this perturbation.
In a way, our systems are like a boulder precariously balanced at the top of a hill. But this view is incomplete. Because what’s keeping the complex system boulder balanced is not a collection of passive supports. Instead, there are a number of active processes, like a group of people that are constantly watching the boulder to see if it starts to slip, and applying force to keep it balanced.
Any successful complex system will have evolved these sorts of dynamic processes. These are what keep the system from falling over every time a kid hits a stray ball.
Note that it’s not the case that all of these processes have to be working for the boulder to stay up. The boulder won’t fall just because someone let their guard down for a moment, or even if one person happened to be absent one day; the boulder would never stay up if it required everyone to behave perfectly all of the time. Because it’s a group of people keeping it balanced, there is redundancy: one person can compensate for another person who falters.
But this keeping-the-boulder-balanced system isn’t perfect. Maybe something comes out of the sky and strikes the boulder with an enormous amount of force. Or maybe several people are sluggish today because they’re sick. Or maybe it rained and the surface of the hill is much slipperier, making it more difficult to navigate. Maybe it’s a combination of all of these.
When the boulder falls, it means that the collection of processes weren’t able to compensate for the disturbance. But there’s no single problem, no root cause, that you can point to, because it’s the collection of these processes working together that normally keep the boulder up.
This is why “root cause of failure” doesn’t make sense in the context of complex systems failure, because a collection of control processes keep the system up and running. A system failure is a failure of this overall set of processes. It’s just not meaningful to single out a problem with one of these processes after an incident, because that process is just one of many, and it failing alone couldn’t have brought down the system.
What makes things even trickier is that some of these processes are invisible, even to the people inside of the system. We don’t see the monitoring and adjustment that is going on around us. Which means we won’t notice if some of these control processes stop happening.
Facing criticism over its practice of monitoring some fires rather than quickly snuffing them out, the U.S. Forest Service has told its firefighters to halt the policy this year to better prioritize resources and help prevent small blazes from growing into uncontrollable conflagrations.
The [Tamarack] fire began as a July 4 lightning strike on a single tree in the Mokelumne Wilderness, a rugged area southeast of Sacramento. Forest officials decided to monitor it rather than attempt to put it out, a decision a spokeswoman said was based on scant resources and the remote location. But the blaze continued to grow, eventually consuming nearly 69,000 acres, destroying homes and causing mass evacuations. It is now 82% contained.
Instead of letting some naturally caused small blazes burn, the agency’s priorities will shift this year, U.S. Forest Service Chief Randy Moore indicated to the staff in a letter Monday. The focus, he said, will be on firefighter and public safety.
The U.S. Forest Service had to make a call about whether to put out a fire or to monitor it and let it burn out. In this case, they decided to monitor it, and the fire grew out of control.
Now, imagine an alternate universe where the Forest Service spent some of its scant resources on putting out this fire, and then another fire popped up somewhere else, and they didn’t have the resources to fight that one effectively, and it went out of control. The news coverage would, undoubtedly, be equally unkind.
Practitioners often must make risk trade-offs in the moment, when there is a high amount of uncertainty. What was the risk that the fire would grow out of control? How does it stack up against the risk of being short staffed if you send out firefighters to put out a small fire and a large one breaks out elsewhere?
Towards the middle of the piece, the article goes into some detail about the issue of limited resources.
[Agriculture Secretary Tom] Vilsack promised more federal aid and cooperation for California’s plight, acknowledging concerns about past practices while also stressing that, with dozens of fires burning across the West and months to go in a prolonged fire season, there are not enough resources to put them all out.
“Candidly I think it’s fair to say, over the generations, over the decades, we have tried to do this job on the cheap,” Vilsack said. “We’ve tried to get by, a little bit here, a little bit there, a little forest management over here, a little fire suppression over here. But the reality is this has caught up with us, which is why we have an extraordinary number of catastrophic fires and why we have to significantly beef up our capacity.”
Vilsack said that the bipartisan infrastructure bill working its way through Congress would provide some of those resources but that ultimately it would take “billions” of dollars and years of catch-up to create fire-resilient forests.
Imagine a colleague comes to you and says, “I’m doing the writeup for a recent incident. I have to specify causes, but I’m not sure which ones to put. Which of these do you think I should go with?”
Engineer entered incorrect configuration value. The engineer put in the wrong config value, which led to the critical foo system to return error responses.
Non-actionable alerts. The engineer who specified the config had just come off an on-call shift where they had to deal with multiple alerts that fired the night before. All of those alerts turned out to be non-actionable. The engineer was tired the day they put in the configuration change. Had the engineer not had to deal with these alerts, they would have been sharper the next day, and likely would have spotted the config problem.
Work prioritization. An accidentally incorrect configuration value was a known risk to the team, and they had been planning to build in some additional verification to guard against these sorts of configuration values. But this work was de-prioritized in favor of work that supported a high-priority feature to the business, which involved coordination across multiple teams. Had the work not been de-prioritized, there would have been guardrails in place that would have prevented the config change from taking down the system.
Powerdynamics. The manager of the foo team had asked leadership for additional headcount, to enable the team to do both the work that was high priority to the business and to work on addressing known risks. However, the request was denied, and the available headcount was allocated to other teams, based on the perceived priorities of the business. If the team manager had had more power in the org, they would have been able to acquire the additional resources and address the known risks.
There’s a sense in which all of these can count as causes. If any of them weren’t present, the incident wouldn’t have happened. But we don’t see them the same way. I can guarantee that you’re never going to see power dynamics listed as a cause in an incident writeup, public or internal.
The reason is not that “incorrect configuration value” is somehow objectively more causal than power dynamics. Rather, the sorts of things that are allowed to be labelled as causes depends on the cultural norms of an organization. This is what people mean when they say that causes are socially constructed.
And who gets to determine what’s allowed to be labelled as a cause and what isn’t is itself a property of power dynamics. Because the things that are allowed to be called causes are things that an organization is willing to label as a problem, which means that it’s something that can receive organizational attention and resources in order to be addressed.
Remember this the next time you identify a contributing factor in an incident and somebody responds with, “that’s not why the incident happened.” That isn’t an objective statement of fact. It’s a value judgment about what’s permitted to be identified as a cause.
Just listening to this experience was so powerful. It taught me to challenge the myth of “commercial pressure”. We tend to think that every organizational problem is the result of cost-cutting. Yet, the cost of a new drain pump was only $90… [that’s] nothing when you are running a ship. As it turned out, the purchase order had landed in the wrong department.
Whenever we read about a public incident, a common pattern in the reporting is that the organization under-invested in some area, and that can explain why the incident happened. “If only the execs hadn’t been so greedy”, we think, “if they had actually invested some additional resources, this wouldn’t have happened!”
What was interesting as well around this time is that when the chemical safety board looked at this in depth, they found a bunch of BP organizational change. There were a whole series of reorganizations of this facility over the past few years that had basically disrupted the understandings of who was responsible for safety and what was that responsibility. And instead of safety being some sort of function here, it became abstracted away into the organization somewhere. A lot of these conditions, this chained-closed outlet here, the failure of the sensors, problems with the operators and so on… all later on seemed to be somehow brought on by the rapid rate of organizational change. And that safety had somehow been lost in the process.
Production pressure is an ever-present risk. his is what David Woods calls faster/better/cheaper pressure, a nod to NASA policy. In fact, if you follow the link to the Richard Cook lecture, right after talking about the BP reorgs, he discusses the role of production pressure in the Texas City BP explosion. However, production pressure is never the whole story.
Remember the Equifax breach that happened a few years ago? Here’s a brief timeline I extracted from the report:
Day 0: (3/7/17) Apache Struts vulnerability CVE-2017-5638 is publicly announced
Day 1: (3/8/17) US-CERT sends an alert to Equifax about the vulnerability
Day 2: (3/9/17) Equifax’s Global Threat and Vulnerability Management (GTVM) team posts to an internal mailing list about the vulnerability and requests that app owners should patch within 48 hours
Day 37: (4/13/17) Attackers exploit the vulnerability in the ACIS app
The Equifax vulnerability management team sent out a notification about the Struts vulnerability a day after they received notice about it. But, as in the two cases above, something got lost in the system. What I wondered reading the report was: How did that notification get lost? Were the engineers who operate the ACIS app not on that mailing list? Did they receive the email, but something kept them from acting on it? Perhaps there was nobody responsible for security patches of the app at the time the notification went out? Maddeningly, the report doesn’t say. After reading that report, I still feel like I don’t understand what it was about the system that enabled the notification to get lost.
It’s so easy to explain an incident by describing how management could have prevented it from investing additional resources. This is what Nippin Anand calls the myth of commercial pressure. It’s all too satisfying for us to identify short-sighted management decisions as the reason that an incident happened.
I’m calling this tendency the greedy exec trap, because once we identify the cause of an incident as greedy executives, we stop asking questions. We already know why the incident happened, so what more do we need to look into? What else is there to learn, really?
Let me make one other observation about this that I think is important, which is that this occurred during startup. That is, once these processes get going, they work in a way that’s different than starting them up. So starting up the process requires a different set of activities than running it continuously. Once you have it running continuously, you can be pouring stuff in one end and getting it out the other, and everything runs smoothly in-between. But startup doesn’t, it doesn’t have things in it, so you have to prime all the pumps by doing a different set of operations.