What do you work on, anyway?

I often struggle to describe the project that I work on at my day job, even though it’s an open-source project that even has its own domain name: managed.delivery. I’ll often mumble something like, “it’s a declarative deployment system”. But that explanation does not yield much insight.

I’m going to use Kubernetes as an analogy to explain my understanding of Managed Delivery. This is dangerous, because I’m not a Kubernetes user(!). But if I didn’t want to live dangerously, I wouldn’t blog.

With Kubernetes, you describe the desired state of your resources declaratively, and then the system takes action to bring the current state of the system to the desired state. In particular, when you use Kubernetes to launch a pod of containers, you need to specify the container image name and version to be deployed as part of the desired state.

When a developer pushes new code out, they need to change the desired state of a resource, specifically, the container image version. This means that a deployment system needs some mechanism for changing the desired state.

A common pattern we see is that service owners have a notion of an environment (e.g., test, staging, prod). For example, maybe they’ll deploy the code to test, and maybe run some automated tests against it, and if it looks good, they’ll promote to staging, and maybe they’ll do some manual tests, and if they’re happy, they’ll promote out to prod.

Example of deployment environments

Imagine test, staging, and prod all have version v23 of the code running in it. After version v24 is cut, it will first be deployed in test, then staging, then prod. That’s how each version will propagate through these environments, assuming it meets the promotion constraints for each environment (e.g., tests pass, human makes a judgment).

You can think of this kind of promoting-code-versions-through-environments as a pattern for describing how the desired states of the environments changes over time. And you can describe this pattern declaratively, rather than imperatively like you would with traditional pipelines.

And that’s what Managed Delivery is. It’s a way of declaratively describing how the desired state of the resources should evolve over time. To use a calculus analogy, you can think of Managed Delivery as representing the time-derivative of the desired state function.

If you think of Kubernetes as a system for specifying desired state, Managed Delivery is a system for specifying how desired state evolves over time

With Managed Delivery, you can say express concepts like:

  • for a code version to be promoted to the staging environment, it must
    • be successfully deployed to the test environment
    • pass a suite of end-to-end automated tests specified by the app owner

and then Managed Delivery uses these environment promotion specifications to shepherd the code through the environments.

And that’s it. Managed Delivery is a system that lets users describe how the desired state changes over time, by letting them specify environments and the rules for promoting change from one from environment to the next.

Modernists trapped in a post-modern universe

There’s a wonderful book by the political philosopher Marshall Berman called All That is Solid Melts Into Air. The subtitle of the book is the experience of modernity, and, indeed, the book tries to capture the feeling of what is like to live in the modern period, as illustrated through the writings of famous modernist authors, both fiction and non.

Berman demonstrates how the modern era, particularly in the late 19th and early 20th century, was a time period of great ferment. The world was seen as turbulent, dynamic. People believed that the world we lived in was not a fixed entity, but that it could be reshaped, remade entirely. The title of a book is a quote from Karl Marx, who was alluding to the idea that all of the structures we see in the world are ephemeral.

In contrast, in the post-modernist view of the world that came later, we can never cast off our history and start from scratch. Every complex system has a history, and that history continues to constrain the behavior of the system, even though it undergoes change.

We software engineers are modernists at heart. We see the legacy systems in our organizations and believe that, when we have the opportunity to work on replacement systems, we will remake our little corner of the world anew. Alas, on this point, the post-modernists were right. While we can change our systems, even replace subsystems wholesale, we can never fully escape the past. We ignore the history of the system at our peril.

Root cause of failure, root cause of success

Here are a couple of tweets from John Allspaw.

Succeeding at a project in an organization is like pushing a boulder up a hill that is too heavy for any single person to lift.

A team working together to successfully move a boulder to the top of the hill

It doesn’t make sense to ask what the “root cause of success” is for an effort like this, because it’s a collaboration that requires the work of many different people to succeed. It’s not meaningful to single out a particular individual as the reason the boulder made it to the top.

Now, let’s imagine that the team got the boulder to the top of the hill, and balanced it precariously at the summit, maybe with some supports to keep it from tumbling down again.

The boulder made it to the top!

Next, imagine that there’s a nearby baseball field, and some kid whacks a fly ball that strikes one of the supports, and the rock tumbles down.

In comes the ball, down goes the boulder

This, I think, is how people tend to view failure in systems. A perturbation comes along, strikes the system, and the system falls over. We associate the root cause with this perturbation.

In a way, our systems are like a boulder precariously balanced at the top of a hill. But this view is incomplete. Because what’s keeping the complex system boulder balanced is not a collection of passive supports. Instead, there are a number of active processes, like a group of people that are constantly watching the boulder to see if it starts to slip, and applying force to keep it balanced.

A collection of people watching the boulder and pushing on it to keep it from falling

Any successful complex system will have evolved these sorts of dynamic processes. These are what keep the system from falling over every time a kid hits a stray ball.

Note that it’s not the case that all of these processes have to be working for the boulder to stay up. The boulder won’t fall just because someone let their guard down for a moment, or even if one person happened to be absent one day; the boulder would never stay up if it required everyone to behave perfectly all of the time. Because it’s a group of people keeping it balanced, there is redundancy: one person can compensate for another person who falters.

But this keeping-the-boulder-balanced system isn’t perfect. Maybe something comes out of the sky and strikes the boulder with an enormous amount of force. Or maybe several people are sluggish today because they’re sick. Or maybe it rained and the surface of the hill is much slipperier, making it more difficult to navigate. Maybe it’s a combination of all of these.

When the boulder falls, it means that the collection of processes weren’t able to compensate for the disturbance. But there’s no single problem, no root cause, that you can point to, because it’s the collection of these processes working together that normally keep the boulder up.

This is why “root cause of failure” doesn’t make sense in the context of complex systems failure, because a collection of control processes keep the system up and running. A system failure is a failure of this overall set of processes. It’s just not meaningful to single out a problem with one of these processes after an incident, because that process is just one of many, and it failing alone couldn’t have brought down the system.

What makes things even trickier is that some of these processes are invisible, even to the people inside of the system. We don’t see the monitoring and adjustment that is going on around us. Which means we won’t notice if some of these control processes stop happening.

Burned by ‘let it burn’

Here are some excerpts from a story from the L.A. Times, with the headline: Forest Service changes ‘let it burn’ policy following criticism from western politicians (emphasis mine)

Facing criticism over its practice of monitoring some fires rather than quickly snuffing them out, the U.S. Forest Service has told its firefighters to halt the policy this year to better prioritize resources and help prevent small blazes from growing into uncontrollable conflagrations.

The [Tamarack] fire began as a July 4 lightning strike on a single tree in the Mokelumne Wilderness, a rugged area southeast of Sacramento. Forest officials decided to monitor it rather than attempt to put it out, a decision a spokeswoman said was based on scant resources and the remote location. But the blaze continued to grow, eventually consuming nearly 69,000 acres, destroying homes and causing mass evacuations. It is now 82% contained.

Instead of letting some naturally caused small blazes burn, the agency’s priorities will shift this year, U.S. Forest Service Chief Randy Moore indicated to the staff in a letter Monday. The focus, he said, will be on firefighter and public safety.

The U.S. Forest Service had to make a call about whether to put out a fire or to monitor it and let it burn out. In this case, they decided to monitor it, and the fire grew out of control.

Now, imagine an alternate universe where the Forest Service spent some of its scant resources on putting out this fire, and then another fire popped up somewhere else, and they didn’t have the resources to fight that one effectively, and it went out of control. The news coverage would, undoubtedly, be equally unkind.

Practitioners often must make risk trade-offs in the moment, when there is a high amount of uncertainty. What was the risk that the fire would grow out of control? How does it stack up against the risk of being short staffed if you send out firefighters to put out a small fire and a large one breaks out elsewhere?

Towards the middle of the piece, the article goes into some detail about the issue of limited resources.

[Agriculture Secretary Tom] Vilsack promised more federal aid and cooperation for California’s plight, acknowledging concerns about past practices while also stressing that, with dozens of fires burning across the West and months to go in a prolonged fire season, there are not enough resources to put them all out.

“Candidly I think it’s fair to say, over the generations, over the decades, we have tried to do this job on the cheap,” Vilsack said. “We’ve tried to get by, a little bit here, a little bit there, a little forest management over here, a little fire suppression over here. But the reality is this has caught up with us, which is why we have an extraordinary number of catastrophic fires and why we have to significantly beef up our capacity.”

Vilsack said that the bipartisan infrastructure bill working its way through Congress would provide some of those resources but that ultimately it would take “billions” of dollars and years of catch-up to create fire-resilient forests.

The U.S. Forest Service’s policy on allowing unplanned wildfires to burn differs from the California Department of Forestry and Fire Protection, and I’m not a domain expert, so I don’t have an informed opinion. But this isn’t just a story about policy, it’s a story about saturation. It’s also about what’s allowed (and not allowed) to count as a cause.

StaffEng podcast

I had fun being a guest on the StaffEng podcast.

Josh Kaderlan (Lob) StaffEng

Today’s guest is Josh Kaderlan, Staff Software Engineer at Lob, a direct mail and address verification API company that automates and simplifies direct mail and address verification, giving businesses greater flexibility, visibility, and accuracy of offline communications and data. There are so many different and non-traditional paths to becoming a staff engineer and, in today’s episode, you’ll hear a bit about Josh’s; starting in quality assurance, moving into development, and ultimately ending up where he is today, which he attributes to his wide breadth of knowledge and the valuable lessons he learned along the way. We discuss why Josh considers himself a generalist, the role that incident management plays in his work, and how he influences his peers by cultivating authentic relationships. Most important to Josh is how the work he does improves the experience of his colleagues and provides value to customers, and he shares some of his tips for staff engineers and organizations looking to have a similar impact. Tune in today to learn more and receive some practical advice and resources!LinksJosh Kaderlan on LinkedInLob‘Staff Archetypes’Hacker NewsRands Leadership Slack
  1. Josh Kaderlan (Lob)
  2. Ben Edmunds (Wayfair)
  3. Nell Shamrell-Harrington (Microsoft)
  4. Rich Lafferty (PagerDuty)
  5. Mason Jones (Credit Karma)

What’s allowed to count as a cause?

Imagine a colleague comes to you and says, “I’m doing the writeup for a recent incident. I have to specify causes, but I’m not sure which ones to put. Which of these do you think I should go with?”

  1. Engineer entered incorrect configuration value. The engineer put in the wrong config value, which led to the critical foo system to return error responses.
  2. Non-actionable alerts. The engineer who specified the config had just come off an on-call shift where they had to deal with multiple alerts that fired the night before. All of those alerts turned out to be non-actionable. The engineer was tired the day they put in the configuration change. Had the engineer not had to deal with these alerts, they would have been sharper the next day, and likely would have spotted the config problem.
  3. Work prioritization. An accidentally incorrect configuration value was a known risk to the team, and they had been planning to build in some additional verification to guard against these sorts of configuration values. But this work was de-prioritized in favor of work that supported a high-priority feature to the business, which involved coordination across multiple teams. Had the work not been de-prioritized, there would have been guardrails in place that would have prevented the config change from taking down the system.
  4. Power dynamics. The manager of the foo team had asked leadership for additional headcount, to enable the team to do both the work that was high priority to the business and to work on addressing known risks. However, the request was denied, and the available headcount was allocated to other teams, based on the perceived priorities of the business. If the team manager had had more power in the org, they would have been able to acquire the additional resources and address the known risks.

There’s a sense in which all of these can count as causes. If any of them weren’t present, the incident wouldn’t have happened. But we don’t see them the same way. I can guarantee that you’re never going to see power dynamics listed as a cause in an incident writeup, public or internal.

The reason is not that “incorrect configuration value” is somehow objectively more causal than power dynamics. Rather, the sorts of things that are allowed to be labelled as causes depends on the cultural norms of an organization. This is what people mean when they say that causes are socially constructed.

And who gets to determine what’s allowed to be labelled as a cause and what isn’t is itself a property of power dynamics. Because the things that are allowed to be called causes are things that an organization is willing to label as a problem, which means that it’s something that can receive organizational attention and resources in order to be addressed.

Remember this the next time you identify a contributing factor in an incident and somebody responds with, “that’s not why the incident happened.” That isn’t an objective statement of fact. It’s a value judgment about what’s permitted to be identified as a cause.

The greedy exec trap

Just listening to this experience was so powerful. It taught me to challenge the myth of “commercial pressure”. We tend to think that every organizational problem is the result of cost-cutting. Yet, the cost of a new drain pump was only $90… [that’s] nothing when you are running a ship. As it turned out, the purchase order had landed in the wrong department.

Nippin Anand, Deep listening — a personal journey

Whenever we read about a public incident, a common pattern in the reporting is that the organization under-invested in some area, and that can explain why the incident happened. “If only the execs hadn’t been so greedy”, we think, “if they had actually invested some additional resources, this wouldn’t have happened!”

What was interesting as well around this time is that when the chemical safety board looked at this in depth, they found a bunch of BP organizational change. There were a whole series of reorganizations of this facility over the past few years that had basically disrupted the understandings of who was responsible for safety and what was that responsibility. And instead of safety being some sort of function here, it became abstracted away into the organization somewhere. A lot of these conditions, this chained-closed outlet here, the failure of the sensors, problems with the operators and so on… all later on seemed to be somehow brought on by the rapid rate of organizational change. And that safety had somehow been lost in the process.

Richard Cook, Process tracing, Texas City BP explosion, 2005, Lectures on the study of cognitive work

Production pressure is an ever-present risk. his is what David Woods calls faster/better/cheaper pressure, a nod to NASA policy. In fact, if you follow the link to the Richard Cook lecture, right after talking about the BP reorgs, he discusses the role of production pressure in the Texas City BP explosion. However, production pressure is never the whole story.

Remember the Equifax breach that happened a few years ago? Here’s a brief timeline I extracted from the report:

  • Day 0: (3/7/17) Apache Struts vulnerability CVE-2017-5638 is publicly announced
  • Day 1: (3/8/17) US-CERT sends an alert to Equifax about the vulnerability
  • Day 2: (3/9/17) Equifax’s Global Threat and Vulnerability Management (GTVM) team posts to an internal mailing list about the vulnerability and requests that app owners should patch within 48 hours
  • ???
  • Day 37: (4/13/17) Attackers exploit the vulnerability in the ACIS app

The Equifax vulnerability management team sent out a notification about the Struts vulnerability a day after they received notice about it. But, as in the two cases above, something got lost in the system. What I wondered reading the report was: How did that notification get lost? Were the engineers who operate the ACIS app not on that mailing list? Did they receive the email, but something kept them from acting on it? Perhaps there was nobody responsible for security patches of the app at the time the notification went out? Maddeningly, the report doesn’t say. After reading that report, I still feel like I don’t understand what it was about the system that enabled the notification to get lost.

It’s so easy to explain an incident by describing how management could have prevented it from investing additional resources. This is what Nippin Anand calls the myth of commercial pressure. It’s all too satisfying for us to identify short-sighted management decisions as the reason that an incident happened.

I’m calling this tendency the greedy exec trap, because once we identify the cause of an incident as greedy executives, we stop asking questions. We already know why the incident happened, so what more do we need to look into? What else is there to learn, really?

Trouble during startup

I asked this question on twitter today:

I received a lot of great responses.

My question was motivated by this lecture by Dr. Richard Cook about an explosion at a BP petroleum processing plant in Texas in 2005 that killed fifteen plant workers. Here’s a brief excerpt, starting at 16:31 of the video:


Let me make one other observation about this that I think is important, which is that this occurred during startup. That is, once these processes get going, they work in a way that’s different than starting them up. So starting up the process requires a different set of activities than running it continuously. Once you have it running continuously, you can be pouring stuff in one end and getting it out the other, and everything runs smoothly in-between. But startup doesn’t, it doesn’t have things in it, so you have to prime all the pumps by doing a different set of operations.


This is yet another reminder of how similar we are to other fields that control processes.

Controlling a process we don’t understand

I was attending the Resilience Engineering Association – Naturalistic Decision Making Symposium last month, and one of the talks was by a medical doctor (an anesthesiologist) who was talking about analyzing incidents in anesthesiology. I immediately thought of Dr. Richard Cook, who is also an anesthesiologist, who has been very active in the field of resilience engineering, and I wondered, “what is it with anesthesiology and resilience engineering?” And then it hit me: it’s about process control.

As software engineers in the field we call “tech”, we often discuss whether we are really engineers in the same sense that a civil engineer is. But, upon reflection I actually think that’s the wrong question to ask. Instead, we should consider the fields there where practitioners are responsible for controlling a dynamic process that’s too complex for humans to fully understand. This type of work involves fields such as spaceflight, aviation, maritime, chemical engineering, power generation (nuclear power in particular), anesthesiology, and, yes, operating software services in the cloud.

We all have displays to look at to tell us the current state of things, alerts that tell us something is going wrong, and knobs that we can fiddle with when we need to intervene in order to bring the process back into a healthy state. We all feel production pressure, are faced with ambiguity (is that blip really a problem?), are faced with high-pressure situations, and have to make consequential decisions under very high degrees of uncertainty.

Whether we are engineers or not doesn’t matter. We’re all operators doing our best to bring complex systems under our control. We face similar challenges, and we should recognize that. That is why I’m so fascinated by fields like cognitive systems engineering and resilience engineering. Because it’s so damned relevant to the kind of work that we do in the world of building and operating cloud services.

Designing like a joint cognitive system

There’s a famous paper by Gary Klein, Paul Feltovich, and David Woods, called Common Ground and Coordination in Joint Activity. Written in 2004, this paper discusses the challenges a group of people face when trying to achieve a common goal. The authors introduce the concept of common ground, which must be established and maintained by all of the participants in order for them to reach the goal together.

I’ve blogged previously about the concept of common ground, and the associated idea of the basic compact. (You can also watch John Allspaw discuss the paper at Papers We Love). Common ground is typically discussed in the context of high-tempo activities. The most popular example in our field is an ad hoc team of engineers responding to an incident.

The book Designing Engineers was originally published in 1994, ten years before the Common Ground paper, and so Louis Bucciarelli never uses the phrase. And yet, the book calls forward to the ideas of common ground, and applies them to the lower-tempo work of engineering design. Engineering design, Bucciarelli claims, is a social process. While some design work is solitary, much of it takes place in social interactions, from formal meetings to informal hallway conversations.

But Bucciarelli does more than discover the ideas of common ground, he extends them. Klein et al. talk about the importance of an agreed upon set of rules, and the need to establish interpredictability: for participants to communicate to each other what they’re going to do next. Bucciarelli talks about how engineering design work involves actually developing the rules, making constraints concrete that were initially uncertain. Instead of interpredictability, Bucciarelli talks about how engineers argue for specific interpretations of requirements based on their own interests. Put simply, where Klein et al., talk about establishing, sustaining, and repairing common ground, Bucciarelli talks about constructing, interpreting, and negotiating the design.

Bucciarelli’s book is fascinating because he reveals how messy and uncertain engineering work is, and how concepts that we may think of as fixed and explicit are actually plastic and ambiguous.

For example, we think of building codes as being precise, but when applied to new situations, they are ambiguous, and the engineers must make a judgment about how to apply them. Bucciarelli tells an anecdote about the design of an an array of solar cells to mount on a roof. The building codes put limits on how much weight a roof can support, but the code only discusses distributed loads, and one of the proposed designs is based on four legs, which would be a concentrated load. An engineer and an architect negotiate on the type of design for the mounting: the engineer favors a solution that’s easier for the engineering company, but more work for the architect. The architect favors a solution that is more work and expense for the engineering company. The two must negotiate to reach an agreement on the design, and the relevant building code must be interpreted in this context.

Bucciarelli also observes that the performance requirements given to engineers are much less precise than you would expect, and so the engineers must construct more precise requirements as part of the design work. He gives the example of a company designing a cargo x-ray system for detecting contraband. The requirement is that it should be able to detect “ten pounds of explosive”. As the engineers prepare to test their prototype, a discussion ensues: what is an explosive? Is it a device with wires? A bag of plastic? The engineers must define what an explosive means, and that definition becomes a performance requirement.

Even technical terms that sound well-defined are ambiguous, and may be interpreted differently by different members of the engineering design team. The author witnesses a discussion of “module voltage” for a solar power generator. But the term can refer to open circuit voltage, maximum power voltage, operating voltage, or nominal voltage. It is only through social interactions that this ambiguity is resolved.

What Bucciarelli also notices in his study of engineers is that they do not themselves recognize the messy, social nature of design: they don’t see the work that they do establishing common ground as the design work. I mentioned this in a previous blog post. And that’s really a shame. Because if we don’t recognize these social interactions as design work, we won’t invest in making them better. To borrow a phrase from cognitive systems engineering, we should treat design work as work that’s done by a joint cognitive system.