Inconceivable

Back in July, Ray Ashman at Mailchimp posted a wonderful writeup of an internal incident (h/t to SRE Weekly). It took the Mailchimp engineers almost two days to make sense of the failure mode.

The trigger was a change to a logging statement, in order to log an exception. During the incident, the engineers noticed that this change lined up with the time that the alerts fired. But, other than the timing, there wasn’t any evidence to suggest that the log statement change was problematic. The change didn’t have any apparent relationship to the symptoms they were seeing with the job runner, which was in a different part of the codebase. And so they assumed that the logging statement change was innocuous.

As it happened, there was a coupling between that log statement and the job runner. Unfortunately for the engineers, this coupling was effectively invisible to them. The connection between the logging statement and the job running was Mailchimp’s log processing pipeline. Here’s an excerpt from the writeup (emphasis mine):

Our log processing pipeline does a bit of normalization to ensure that logs are formatted consistently; a quirk of this processing code meant that trying to log a PHP object that is Iterable would result in that object’s iterator methods being invoked (for example, to normalize the log format of an Array).

Normally, this is an innocuous behavior—but in our case, the harmless logging change that had shipped at the start of the incident was attempting to log PHP exception objects. Since they were occurring during job execution, these exceptions held a stacktrace that included the method the job runner uses to claim jobs for execution (“locking”)—meaning that each time one of these exceptions made it into the logs, the logging pipeline itself was invoking the job runner’s methods and locking jobs that would never be actually run! 

Fortunately, there were engineers who had experience with this failure mode before:

Since the whole company had visibility into our progress on the incident, a couple of engineers who had been observing realized that they’d seen this exact kind of issue some years before.



Having identified the cause, we quickly reverted the not-so-harmless logging change, and our systems very quickly returned to normal.

In the moment, the engineers could not conceive of how a change in behavior in the job runner could be affected by the modification of a log statement in an unrelated part of the code. It was literally unthinkable to them.

Software Misadventures Podcast

I was recently a guest on the Software Misadventures Podcast.

Bruno Connelly – Building and leading the global SRE org at LinkedIn – #14 Software Misadventures

Bruno Connelly is a VP of Engineering at LinkedIn. He leads the Site Engineering org responsible for LinkedIn's production infrastructure. He joins the show to talk about his journey in tech – from teaching himself how to code at a young age, building, maintaining and reverse engineering software as a teenager, building ISPs in the early part of his career (there are some fun stories that involve sleeping in the data center) to leading the SRE org at LinkedIn over the last decade. He talks about the early days at LinkedIn that involved a lot of firefighting to keep the site up, how the team built technical stability and scaled the platform. We also dive into how he grew the SRE org globally and overcame challenges that came with the growth. Throughout the conversation, he shares various nuggets of wisdom – like how to stay calm under pressure and how to make people feel at ease – as he describes his leadership style, people who have influenced him and what he thinks is a positive way to collaborate with people.   Website link: https://softwaremisadventures.com/bruno   Music Credits: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en
  1. Bruno Connelly – Building and leading the global SRE org at LinkedIn – #14
  2. Lorin Hochstein – On how Netflix learns from incidents, software as socio-technical systems, writing persuasively and more – #13
  3. Spoons (Daniel Spoonhower) – On building Lightstep, being customer focused, developing systems at Google scale and much more – #12
  4. Emmanuel Ameisen – On production ML at Stripe scale, leading 100+ ML projects, iterating fast, and much more – #11
  5. Todd Underwood – On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more – #10

Contempt for the glue people

The clip below is from a lecture from 2008(?) that then-Google CEO Eric Schmidt gave to a Stanford class.

Here’s a transcript, emphasis mine.

When I was at Novell, I had learned that there were people who I call “glue people”. The glue people are incredibly nice people who sit at interstitial boundaries between groups, and they assist in activity. And they are very, very loyal, and people love them, and you don’t need them at all.

At Novell, I kept trying to get rid of these glue people, because they were getting in the way, because they slowed everything down. And every time I get rid of them in one group, they’d show up in another group, and they’d transfer, and get rehired and all that.

I was telling Larry [Page] and Sergey [Brin] this one day, and Larry said, “I don’t understand what your problem is. Why don’t we just review all of the hiring?” And I said, “What?” And Larry said, “Yeah, let’s just review all the hiring.” And I said, “Really?” He said, “Yes”.

So, guess what? From that moment on, we reviewed every offer packet, literally every one. And anybody who smelled or looked like a glue person, plus the people that Larry and Sergey thought had backgrounds that I liked that they didn’t, would all be fired.

I first watched this lecture years ago, but Schmidt’s expressed contempt for the nice and loyal but useless glue people just got lodged in my brain, and I’ve never forgotten it. For some reason, this tweet about Google’s various messaging services sparked my memory about it, hence this post.

Remembering the important bits when you need them

I’m working my way through the Cambridge Handbook of Expertise and Expert Performance, which is a collection of essays from academic researchers who study expertise.

Chapter 6 discusses the ability of experts to recall information that’s relevant to the task at hand. This is one of the differences between experts and novices: a novice might answer questions about a subject correctly on a test, but when faced with a real problem that requires that knowledge, they aren’t able to retrieve it.

The researchers K. Anders Ericsson and Walter Kintsch had an interesting theory about how experts do better at this than novices. The theory goes like this: when an expert encounters some new bit of information, they have the ability to encode that information into their long-term memory in association with a collection of cues of when that information would be relevant.

In other words, experts are able to predict the context when that information might be relevant in the future, and are able to use that contextual information as a kind of key that they can use to retrieve the information later on.

Now, think about reading an incident write-up. You might learn about a novel failure mode in some subsystem your company uses (say, a database), as well as the details that led up to it happening, including some of the weird, anomalous signals that were seen earlier on. If you have expertise in operations, you’ll encode information about the failure mode into your long term memory and associate it with the symptoms. So, the next time you see those symptoms in production, you’ll remember this failure mode.

This will only work if the incident write-up has enough detail to provide you with the cues that you need to encode in your memory. This is another reason to provide a rich description of the incident. Because the people reading it, if they’re good at operations, will encode the details of the failure mode into their memory. If it happens again, and they read the write up, they’ll remember.

The power of framing a problem

I’m enjoying Marianne Bellotti’s book Kill It With Fire, which is a kind of guidebook for software modernization projects (think: migrating legacy systems). In Chapter Five, she talks about the importance of momentum for success, and how a crisis can be a valuable tool for creating a sense of urgency. This is the passage that really resonated with me (emphasis in the original):

Occasionally, I went as far as looking for a crisis to draw attention to. This usually didn’t require too much effort. Any system more than five years old will have at least a couple major things wrong with it. It didn’t mean lying, and it didn’t mean injecting problems where they didn’t exist. Instead, it was a matter of storytelling—taking something that was unreported and highlighting its potential risks. These problems were problems, and my analysis of their potential impact was always truthful, but some of them could have easily stayed buried for months or years without triggering a single incident.

Kill It With Fire, p88

This is a great explanation of how describing a problem is a form of power in an organization. Bellotti demonstrates how, by telling a story, she was able to make problems real for an organization, even to the point of creating a crisis. And a crisis receives attention and resources. Crises get resolved.

It’s also a great example the importance of storytelling in technical organizations. Tell a good story, and you can make things happen. It’s a skill that’s worth getting better at.

The local nature of culture

I’m really enjoying Turn the Ship Around!, a book by David Marquet about his experiences as commander of a nuclear submarine, the USS Santa Fe, and how he worked to improve its operational performance.

One of the changes that Marquet introduced is something he calls “thinking out loud”, where he encourages crew members to speak aloud their thoughts about things like intentions, expectations, and concerns. He notes that this approach contradicted naval best practices:

As naval officers, we stress formal communications and even have a book, the Interior Communications Manual, that specifies exactly how equipment, watch stations, and evolutions are spoken, written, and abbreviated …

This adherence to formal communications unfortunately crowds out the less formal but highly important contextual information needed for peak team performance. Words like “I think…” or “I am assuming…” or “It is likely…” that are not specific and concise orders get written up by inspection teams as examples of informal communications, a big no-no. But that is just the communication we need to make leader-leader work.

Turn the Ship Around! p103

This change did improve the ship operations, and this improvement was recognized by the Navy. Despite that, Marquet still got pushback for violating norms.

[E]ven though Santa Fe was performing at the top of the fleet, officers steeped in the leader-follower mind-set would criticize what they viewed as the informal communications on Santa Fe. If you limit all discussion to crisp orders and eliminate all contextual discussion, you get a pretty quiet control room. That was viewed as good. We cultivated the opposite approach and encouraged a constant buzz of discussions among the watch officers and crew. By monitoring that level of buzz, more than the actual content, I got a good gauge of how well the ship was running and whether everyone was sharing information.

Turn the Ship Around! p103

Reading this reminded me how local culture can be. I shouldn’t be surprised, though. At Netflix, I’ve worked on three teams (and six managers!) and each team had very different local cultures, despite all of them being in the same organization, Platform Engineering.

I used to wonder, “how does a large company like Google write software?” But I no longer think that’s a meaningful question. It’s not Google as an organization that writes software, it’s individual teams that do. The company provides the context that the teams work in, and the teams are constrained by various aspects of the organization, including the history of the technology they work on. But, there’s enormous cultural variation from one team to the next. And, as Marquet illustrates, you can change your local culture, even cutting against organizational “best practices”.

So, instead of asking, “what is it like to work at company X”, the question you really want answered is, “what is it like to work on team Y at company X?”

What do you work on, anyway?

I often struggle to describe the project that I work on at my day job, even though it’s an open-source project that even has its own domain name: managed.delivery. I’ll often mumble something like, “it’s a declarative deployment system”. But that explanation does not yield much insight.

I’m going to use Kubernetes as an analogy to explain my understanding of Managed Delivery. This is dangerous, because I’m not a Kubernetes user(!). But if I didn’t want to live dangerously, I wouldn’t blog.

With Kubernetes, you describe the desired state of your resources declaratively, and then the system takes action to bring the current state of the system to the desired state. In particular, when you use Kubernetes to launch a pod of containers, you need to specify the container image name and version to be deployed as part of the desired state.

When a developer pushes new code out, they need to change the desired state of a resource, specifically, the container image version. This means that a deployment system needs some mechanism for changing the desired state.

A common pattern we see is that service owners have a notion of an environment (e.g., test, staging, prod). For example, maybe they’ll deploy the code to test, and maybe run some automated tests against it, and if it looks good, they’ll promote to staging, and maybe they’ll do some manual tests, and if they’re happy, they’ll promote out to prod.

Example of deployment environments

Imagine test, staging, and prod all have version v23 of the code running in it. After version v24 is cut, it will first be deployed in test, then staging, then prod. That’s how each version will propagate through these environments, assuming it meets the promotion constraints for each environment (e.g., tests pass, human makes a judgment).

You can think of this kind of promoting-code-versions-through-environments as a pattern for describing how the desired states of the environments changes over time. And you can describe this pattern declaratively, rather than imperatively like you would with traditional pipelines.

And that’s what Managed Delivery is. It’s a way of declaratively describing how the desired state of the resources should evolve over time. To use a calculus analogy, you can think of Managed Delivery as representing the time-derivative of the desired state function.

If you think of Kubernetes as a system for specifying desired state, Managed Delivery is a system for specifying how desired state evolves over time

With Managed Delivery, you can say express concepts like:

  • for a code version to be promoted to the staging environment, it must
    • be successfully deployed to the test environment
    • pass a suite of end-to-end automated tests specified by the app owner

and then Managed Delivery uses these environment promotion specifications to shepherd the code through the environments.

And that’s it. Managed Delivery is a system that lets users describe how the desired state changes over time, by letting them specify environments and the rules for promoting change from one from environment to the next.

Modernists trapped in a post-modern universe

There’s a wonderful book by the political philosopher Marshall Berman called All That is Solid Melts Into Air. The subtitle of the book is the experience of modernity, and, indeed, the book tries to capture the feeling of what is like to live in the modern period, as illustrated through the writings of famous modernist authors, both fiction and non.

Berman demonstrates how the modern era, particularly in the late 19th and early 20th century, was a time period of great ferment. The world was seen as turbulent, dynamic. People believed that the world we lived in was not a fixed entity, but that it could be reshaped, remade entirely. The title of a book is a quote from Karl Marx, who was alluding to the idea that all of the structures we see in the world are ephemeral.

In contrast, in the post-modernist view of the world that came later, we can never cast off our history and start from scratch. Every complex system has a history, and that history continues to constrain the behavior of the system, even though it undergoes change.

We software engineers are modernists at heart. We see the legacy systems in our organizations and believe that, when we have the opportunity to work on replacement systems, we will remake our little corner of the world anew. Alas, on this point, the post-modernists were right. While we can change our systems, even replace subsystems wholesale, we can never fully escape the past. We ignore the history of the system at our peril.

Root cause of failure, root cause of success

Here are a couple of tweets from John Allspaw.

Succeeding at a project in an organization is like pushing a boulder up a hill that is too heavy for any single person to lift.

A team working together to successfully move a boulder to the top of the hill

It doesn’t make sense to ask what the “root cause of success” is for an effort like this, because it’s a collaboration that requires the work of many different people to succeed. It’s not meaningful to single out a particular individual as the reason the boulder made it to the top.

Now, let’s imagine that the team got the boulder to the top of the hill, and balanced it precariously at the summit, maybe with some supports to keep it from tumbling down again.

The boulder made it to the top!

Next, imagine that there’s a nearby baseball field, and some kid whacks a fly ball that strikes one of the supports, and the rock tumbles down.

In comes the ball, down goes the boulder

This, I think, is how people tend to view failure in systems. A perturbation comes along, strikes the system, and the system falls over. We associate the root cause with this perturbation.

In a way, our systems are like a boulder precariously balanced at the top of a hill. But this view is incomplete. Because what’s keeping the complex system boulder balanced is not a collection of passive supports. Instead, there are a number of active processes, like a group of people that are constantly watching the boulder to see if it starts to slip, and applying force to keep it balanced.

A collection of people watching the boulder and pushing on it to keep it from falling

Any successful complex system will have evolved these sorts of dynamic processes. These are what keep the system from falling over every time a kid hits a stray ball.

Note that it’s not the case that all of these processes have to be working for the boulder to stay up. The boulder won’t fall just because someone let their guard down for a moment, or even if one person happened to be absent one day; the boulder would never stay up if it required everyone to behave perfectly all of the time. Because it’s a group of people keeping it balanced, there is redundancy: one person can compensate for another person who falters.

But this keeping-the-boulder-balanced system isn’t perfect. Maybe something comes out of the sky and strikes the boulder with an enormous amount of force. Or maybe several people are sluggish today because they’re sick. Or maybe it rained and the surface of the hill is much slipperier, making it more difficult to navigate. Maybe it’s a combination of all of these.

When the boulder falls, it means that the collection of processes weren’t able to compensate for the disturbance. But there’s no single problem, no root cause, that you can point to, because it’s the collection of these processes working together that normally keep the boulder up.

This is why “root cause of failure” doesn’t make sense in the context of complex systems failure, because a collection of control processes keep the system up and running. A system failure is a failure of this overall set of processes. It’s just not meaningful to single out a problem with one of these processes after an incident, because that process is just one of many, and it failing alone couldn’t have brought down the system.

What makes things even trickier is that some of these processes are invisible, even to the people inside of the system. We don’t see the monitoring and adjustment that is going on around us. Which means we won’t notice if some of these control processes stop happening.

Burned by ‘let it burn’

Here are some excerpts from a story from the L.A. Times, with the headline: Forest Service changes ‘let it burn’ policy following criticism from western politicians (emphasis mine)

Facing criticism over its practice of monitoring some fires rather than quickly snuffing them out, the U.S. Forest Service has told its firefighters to halt the policy this year to better prioritize resources and help prevent small blazes from growing into uncontrollable conflagrations.

The [Tamarack] fire began as a July 4 lightning strike on a single tree in the Mokelumne Wilderness, a rugged area southeast of Sacramento. Forest officials decided to monitor it rather than attempt to put it out, a decision a spokeswoman said was based on scant resources and the remote location. But the blaze continued to grow, eventually consuming nearly 69,000 acres, destroying homes and causing mass evacuations. It is now 82% contained.

Instead of letting some naturally caused small blazes burn, the agency’s priorities will shift this year, U.S. Forest Service Chief Randy Moore indicated to the staff in a letter Monday. The focus, he said, will be on firefighter and public safety.

The U.S. Forest Service had to make a call about whether to put out a fire or to monitor it and let it burn out. In this case, they decided to monitor it, and the fire grew out of control.

Now, imagine an alternate universe where the Forest Service spent some of its scant resources on putting out this fire, and then another fire popped up somewhere else, and they didn’t have the resources to fight that one effectively, and it went out of control. The news coverage would, undoubtedly, be equally unkind.

Practitioners often must make risk trade-offs in the moment, when there is a high amount of uncertainty. What was the risk that the fire would grow out of control? How does it stack up against the risk of being short staffed if you send out firefighters to put out a small fire and a large one breaks out elsewhere?

Towards the middle of the piece, the article goes into some detail about the issue of limited resources.

[Agriculture Secretary Tom] Vilsack promised more federal aid and cooperation for California’s plight, acknowledging concerns about past practices while also stressing that, with dozens of fires burning across the West and months to go in a prolonged fire season, there are not enough resources to put them all out.

“Candidly I think it’s fair to say, over the generations, over the decades, we have tried to do this job on the cheap,” Vilsack said. “We’ve tried to get by, a little bit here, a little bit there, a little forest management over here, a little fire suppression over here. But the reality is this has caught up with us, which is why we have an extraordinary number of catastrophic fires and why we have to significantly beef up our capacity.”

Vilsack said that the bipartisan infrastructure bill working its way through Congress would provide some of those resources but that ultimately it would take “billions” of dollars and years of catch-up to create fire-resilient forests.

The U.S. Forest Service’s policy on allowing unplanned wildfires to burn differs from the California Department of Forestry and Fire Protection, and I’m not a domain expert, so I don’t have an informed opinion. But this isn’t just a story about policy, it’s a story about saturation. It’s also about what’s allowed (and not allowed) to count as a cause.