As I’ve posted about previously, at my day job, I work on a project called Managed Delivery. When I first joined the team, I was a little horrified to learn that the service that powers Managed Delivery deploy itself using Managed Delivery.
“How dangerous!”, I thought. What if we push out a change that breaks Managed Delivery? How will we recover? However, after having been on the team for over a year now, I have a newfound appreciation for this approach.
Yes, sometimes there’s something that breaks, and that makes it harder to roll back, because Managed Delivery provides the main functionality for easy rollback. However, it also means that the team gets quite a bit of practice at bypassing Managed Delivery when something goes wrong. They know how to disable Managed Delivery and use the traditional Spinnaker UI to deploy an older version. They know how to poke and prod at the database if the Managed Delivery UI doesn’t respond properly.
These strange loop failure modes are real: if Managed Delivery breaks, we may lose out on the functionality of Managed Delivery to help us recover. But it also means that we’re more ready for handling the situation if something with Managed Delivery goes awry. Yes, Managed Delivery depends on itself, and that’s odd. But we have experience with how to handle things when this strange loop dependency creates a problem. And that is a valuable thing.
I’m really enjoying Turn the Ship Around!, a book by David Marquet about his experiences as commander of a nuclear submarine, the USS Santa Fe, and how he worked to improve its operational performance.
One of the changes that Marquet introduced is something he calls “thinking out loud”, where he encourages crew members to speak aloud their thoughts about things like intentions, expectations, and concerns. He notes that this approach contradicted naval best practices:
As naval officers, we stress formal communications and even have a book, the Interior Communications Manual, that specifies exactly how equipment, watch stations, and evolutions are spoken, written, and abbreviated …
This adherence to formal communications unfortunately crowds out the less formal but highly important contextual information needed for peak team performance. Words like “I think…” or “I am assuming…” or “It is likely…” that are not specific and concise orders get written up by inspection teams as examples of informal communications, a big no-no. But that is just the communication we need to make leader-leader work.
Turn the Ship Around! p103
This change did improve the ship operations, and this improvement was recognized by the Navy. Despite that, Marquet still got pushback for violating norms.
[E]ven though Santa Fe was performing at the top of the fleet, officers steeped in the leader-follower mind-set would criticize what they viewed as the informal communications on Santa Fe. If you limit all discussion to crisp orders and eliminate all contextual discussion, you get a pretty quiet control room. That was viewed as good. We cultivated the opposite approach and encouraged a constant buzz of discussions among the watch officers and crew. By monitoring that level of buzz, more than the actual content, I got a good gauge of how well the ship was running and whether everyone was sharing information.
Turn the Ship Around! p103
Reading this reminded me how local culture can be. I shouldn’t be surprised, though. At Netflix, I’ve worked on three teams (and six managers!) and each team had very different local cultures, despite all of them being in the same organization, Platform Engineering.
I used to wonder, “how does a large company like Google write software?” But I no longer think that’s a meaningful question. It’s not Google as an organization that writes software, it’s individual teams that do. The company provides the context that the teams work in, and the teams are constrained by various aspects of the organization, including the history of the technology they work on. But, there’s enormous cultural variation from one team to the next. And, as Marquet illustrates, you can change your local culture, even cutting against organizational “best practices”.
So, instead of asking, “what is it like to work at company X”, the question you really want answered is, “what is it like to work on team Y at company X?”
I often struggle to describe the project that I work on at my day job, even though it’s an open-source project that even has its own domain name: managed.delivery. I’ll often mumble something like, “it’s a declarative deployment system”. But that explanation does not yield much insight.
I’m going to use Kubernetes as an analogy to explain my understanding of Managed Delivery. This is dangerous, because I’m not a Kubernetes user(!). But if I didn’t want to live dangerously, I wouldn’t blog.
With Kubernetes, you describe the desired state of your resources declaratively, and then the system takes action to bring the current state of the system to the desired state. In particular, when you use Kubernetes to launch a pod of containers, you need to specify the container image name and version to be deployed as part of the desired state.
When a developer pushes new code out, they need to change the desired state of a resource, specifically, the container image version. This means that a deployment system needs some mechanism for changing the desired state.
A common pattern we see is that service owners have a notion of an environment (e.g., test, staging, prod). For example, maybe they’ll deploy the code to test, and maybe run some automated tests against it, and if it looks good, they’ll promote to staging, and maybe they’ll do some manual tests, and if they’re happy, they’ll promote out to prod.
Imagine test, staging, and prod all have version v23 of the code running in it. After version v24 is cut, it will first be deployed in test, then staging, then prod. That’s how each version will propagate through these environments, assuming it meets the promotion constraints for each environment (e.g., tests pass, human makes a judgment).
You can think of this kind of promoting-code-versions-through-environments as a pattern for describing how the desired states of the environments changes over time. And you can describe this pattern declaratively, rather than imperatively like you would with traditional pipelines.
And that’s what Managed Delivery is. It’s a way of declaratively describing how the desired state of the resources should evolve over time. To use a calculus analogy, you can think of Managed Delivery as representing the time-derivative of the desired state function.
With Managed Delivery, you can say express concepts like:
for a code version to be promoted to the staging environment, it must
be successfully deployed to the test environment
pass a suite of end-to-end automated tests specified by the app owner
and then Managed Delivery uses these environment promotion specifications to shepherd the code through the environments.
And that’s it. Managed Delivery is a system that lets users describe how the desired state changes over time, by letting them specify environments and the rules for promoting change from one from environment to the next.
Let me make one other observation about this that I think is important, which is that this occurred during startup. That is, once these processes get going, they work in a way that’s different than starting them up. So starting up the process requires a different set of activities than running it continuously. Once you have it running continuously, you can be pouring stuff in one end and getting it out the other, and everything runs smoothly in-between. But startup doesn’t, it doesn’t have things in it, so you have to prime all the pumps by doing a different set of operations.
I was attending the Resilience Engineering Association – Naturalistic Decision Making Symposium last month, and one of the talks was by a medical doctor (an anesthesiologist) who was talking about analyzing incidents in anesthesiology. I immediately thought of Dr. Richard Cook, who is also an anesthesiologist, who has been very active in the field of resilience engineering, and I wondered, “what is it with anesthesiology and resilience engineering?” And then it hit me: it’s about process control.
As software engineers in the field we call “tech”, we often discuss whether we are really engineers in the same sense that a civil engineer is. But, upon reflection I actually think that’s the wrong question to ask. Instead, we should consider the fields there where practitioners are responsible for controlling a dynamic process that’s too complex for humans to fully understand. This type of work involves fields such as spaceflight, aviation, maritime, chemical engineering, power generation (nuclear power in particular), anesthesiology, and, yes, operating software services in the cloud.
We all have displays to look at to tell us the current state of things, alerts that tell us something is going wrong, and knobs that we can fiddle with when we need to intervene in order to bring the process back into a healthy state. We all feel production pressure, are faced with ambiguity (is that blip really a problem?), are faced with high-pressure situations, and have to make consequential decisions under very high degrees of uncertainty.
Whether we are engineers or not doesn’t matter. We’re all operators doing our best to bring complex systems under our control. We face similar challenges, and we should recognize that. That is why I’m so fascinated by fields like cognitive systems engineering and resilience engineering. Because it’s so damned relevant to the kind of work that we do in the world of building and operating cloud services.
Recently, Vijay Chidambaram (a CS professor at UT Austin) asked me, “Why do so many outages involve configuration changes?”
I didn’t have a good explanation for him, and I still don’t. I’m using this post as an exercise of thinking out loud about possible explanations for this phenomenon.
It’s an illusion
It might be that config changes are not somehow more dangerous, it just seems like they are. Perhaps we only notice the writeups where a config change is mentioned, but we don’t remember the writeups that don’t involve a config change. Or perhaps it’s a base rate illusion, where config changes tend to be involved in incidents more often than code changes simply because config changes are more common than code changes.
I don’t believe this hypothesis: I think the config change effect is a real one.
For many of Salesforce’s systems, the deployment pipelines have built-in stagger and canary requirements that are automated. For Salesforce’s DNS systems, the automation and enforcement of staggering through technology is still being built. For this configuration change and script, the stagger process was still manual.
If an organization has the ability to stage their changes across different domains, I’d wager heavily that they supported staged code deployments before they supported staged configuration change. That’s certainly true at Netflix, where Spinnaker had support for regional rollout of code changes well before it had support for regional rollout of config changes.
This one feels like a real contributor to me. I’ve found that deployment tooling tends to support code changes better than config change: there’s just more engineering effort put into making code changes safer.
Config changes are hard to stage
In the case of the Salesforce incident, the configuration change could theoretically have been staged. However, it may be that configuration changes by their nature are harder to roll out in a staged fashion. Configuration is more likely to be inherently global than code.
I’m really not sure about this one. I have no sense as to how many config changes can be staged.
Config changes are hard to test
Have you ever written a unit test for a configuration value? I haven’t. It might be that config-change related problems only manifest when deployed into a production environment, so you couldn’t catch them at a smaller scope like a unit test.
I suspect this hypothesis plays a significant role as well.
Mature systems are more config-driven
Perhaps the sort of systems that are involved in large-scale outages at big tech companies are the more mature, reliable systems. These are the types of software that have evolved over time to enable operators to control more of their behavior by specifying policy in configuration.
This means that an operator is more likely to be able to achieve a desired behavior change via config versus code. And that sounds like a good thing. We all know that hard-coding things is bad, and changing code is dangerous. In the limit, we wouldn’t have to make any code changes at all to achieve the desired system behavior.
So, perhaps the fact that config changes are more commonly implicated in large-scale outages is a sign of the maturity of the systems?
I have no idea about this one. It seems like a clever hypothesis, but perhaps it’s too clever.
We seldom have time for introspection at work. If we’re lucky, we have the opportunity to do some kind of retrospective at the end of a project or sprint. But, generally speaking, we’re too busy working to spend time examining that work.
One exception to this is incidents: organizations are willing to spend effort on introspection after an incident happens. That’s because incidents are unsettling: people feel uneasy that the system failed in a way they didn’t expect.
And so, an organization is willing to spend precious engineering cycles in order to rid itself of the uneasy feeling that comes with a system failing unexpectedly. Let’s make sure this never happens again.
Incident analysis, in the learning from incidents in software (LFI) sense, is about using an incident as an opportunity to get a better understanding of how the overall system works. It’s a kind of case study, where the case is the incident. The incident acts as a jumping-off point for an analyst to study an aspect of the system. Just like any other case study, it involves collecting and synthesizing data from multiple sources (e.g., interviews, chat transcripts, metrics, code commits).
I call it a guerrilla case study because, from the organization’s perspective, the goal is really to get closure, to have a sense that all is right with the world. People want to get to a place where the failure mode is now well-understood and measures will be put in place to prevent it from happening again. As LFI analysts, we’re exploiting this desire for closure to justify spending time examining how work is really done inside of the system.
Ideally, organizations would recognize the value of this sort of work, and would make it explicit that the goal of incident analysis is to learn as much as possible. They’d also invest in other types of studies that look into how the overall system works. Alas, that isn’t the world we live in, so we have to sneak this sort of work in where we can.
One of my hobbies is learning Yiddish. Growing up Jewish in Montreal, I attended a parochial elementary school that taught Yiddish (along with French and Hebrew), but dropped it after that. A couple of years ago, I discovered a Yiddish teacher in my local area and I started taking classes for fun.
Our teacher recently introduced us to a Yiddish expression, hintish-kloog, which translates literally as “dog smartness”. It refers to a dog’s ability to sniff out and locate food in all sorts of places.
This made me think of the kind of skill required to solve operational problems during the moment. It’s a very different kind of skill than, say, constructing abstractions during software development. Instead, it’s more about employing a set of heuristics to try to diagnose the issue, hunting through our dashboards to look for useful signals. “Did something change recently? Are errors up? Is the database healthy?”
My teacher noted that that many of the more religious Jews tend to look down on owning a dog, and so hintish-kloog is meant in a pejorative sense: this isn’t the kind of intelligence that is prized by scholars. This made me think about the historical difference in prestige between development and operations work, where skilled operations work is seen as a lower form of work than skilled development work.
I’m glad that this perception of operations is changing over time, and that more software engineers are doing the work of operating their own software. Dog smartness is a survival skill, and we need more of it.
Author’s note: I initially had the Yiddish wording incorrect, this post has been updated with the correct wording.
Making the rounds is the story of how Citi accidentally transferred $900 million dollars to various hedge funds. Citi then asked the funds to reverse the mistaken transfer, and while some of the funds did, others said, “no, it’s ours, and we’re keeping it”, and Citi took them to court, and lost. The wonderful finance writer Matt Levine has the whole story. At this center of this is horrible UX associated with internal software, you can see screenshots in Levine’s writeup. As an aside, several folks on the Hacker News thread recognized the UI widgets as having been built with Oracle Forms.
However, this post isn’t about a particular internal software package with lousy UX. (There is no shortage of such software packages in the world, ask literally anyone who deals with internal software).
Instead, I’m going to explore two questions:
How come we don’t hear about these sorts of accidental financial transactions more often?
How come financial organizations like Citibank don’t invest in improving internal software UX for reducing risk?
I’ve never worked in the financial industry, so I have no personal experience with this domain. But I suspect that accidental financial transactions, while rare, do happen from time to time. But what I suspect happens most of the time is that the institution that initiated the accidental transaction reaches out and explains what happens to the other institution, and they transfer the money back.
As Levine points out, there’s no finders keepers rule in the U.S. I suspect that there aren’t any organizations that have a risk scenario with the summary, “we accidentally transfer an enormous sum of money to an organization that is legally entitled to keep it.” because that almost never happens. This wasn’t a case of fraud. This was a weird edge case in the law where the money transferred was an accidental repayment of a loan in full, when Citi just meant to make an interest payment, and there’s a specific law about this scenario (in fact, Citi didn’t really want to make a payment at all, but they had to because of a technical issue).
Can you find any other time in the past where an institution accidentally transferred funds and the recipient was legally permitted to keep the money? If so, I’d love to hear it.
And, if it really is the case that these sorts of mistakes aren’t seen as a risk, then why would an organization like Citi invest in improving the usability of their internal tools? Heck, if you read the article, you’ll see that it was actually contractors that operate the software. It’s not like Citi would be more profitable if they were able to improve the usability of this software. “Who cares if it takes a contractor 10 minutes versus 30 minutes?” I can imagine an exec saying.
Don’t get me wrong: my day job is building internal tools, so I personally believe these tools add value. And I imagine that financial institutions invest in the tooling of their algorithmic traders, because correctness and development speed go directly to their bottom lines. But the folks operating the software that initiates these sorts of transactions? That’s just grunt work, nobody’s going to invest in improving those experiences.
In short, these systems don’t fall over all of the time because the systems aren’t just made up of horrible software. They’re made up of horrible software, and human beings who can exercise judgment when something goes wrong and compensate. Most of the time, that’s good enough.
In this context, I was thinking about an operational surprise that happened on my team a few months ago, so that I could use it as raw material to construct an oral story about it. But, as I reflected on it, and read my own (lengthy) writeup, I realized that there was one thing I didn’t fully understand about what happened.
During the operational surprise, when we attempted to remediate the problem by deploying a potential fix into production, we hit a latent bug that had been merged into the main branch ten days earlier. As i was re-reading the writeup, there was something I didn’t understand. How did it come to be that we went ten days without promoting that code from the main branch of our repo to the production environment?
To help me make sense of what happened, I drew a diagram of the development events that lead up to the surprise. Fortunately, I had documented those events thoroughly in the original writeup. Here’s the diagram I created. I used this diagram to get some insight into how bug T2, which was merged into our repo on day 0, did not manifest in production until day 10.
This diagram will take some explanation, so bear with me.
There are four bugs in this story, denoted T1,T2, A1, A2. The letters indicate the functionality associated with the PR that introduced them:
T1, T2 were both introduced in a pull request (PR) related to refactoring of some functionality related to how our service interacts with Titus.
A1, A2 were both introduced in a PR related to adding functionality around artifact metadata.
Note that bug T1 masked T2, and bug A1 masked A2.
There are three vertical lines, which show how the bugs propagated to different environments.
main (repo) represents code in the main branch of our repository.
staging represents code that has been deployed to our staging environment.
prod represents code that has been deployed to our production environment.
Here’s how the colors work:
gray indicates that the bug is present in an environment, but hasn’t been detected
red indicates that the effect of a bug has been observed in an environment. Note that if we detect a bug in the prod environment, that also tells us that the bug is in staging and the repo.
green indicates the bug has been fixed
If a horizontal line is red, that means there’s a known bug in that environment. For example, when we detect bug T1 in prod on day 1, all three lines go red, since we know we have a bug.
A horizontal line that is purple means that we’ve pinned to a specific version. We unpinned prod on day 10 before we deployed.
The thing I want to call out in this diagram is the color in the staging line. once the staging line turns red on day 2, it only turns black on day 5, which is the Saturday of a long weekend, and then turns red again on the Monday of the long weekend. (Yes, some people were doing development on the Saturday and testing in staging on Monday, even though it was a long weekend. We don’t commonly work on weekends, that’s a different part of the story).
During this ten day period, there was only a brief time when staging was in a state we thought was good, and that was over a weekend. Since we don’t deploy on weekends unless prod is in a bad state, it makes sense that we never deployed from staging to prod until day 10.
The larger point I want to make here is that getting this type of insight from an operational surprise is hard, in the sense that it takes a lot of effort. Even though I put in the initial effort to capture the development activity leading up to the surprise when I first did the writeup, I didn’t gain the above insight until months later, when I tried to understand this particular aspect of it. I had to ask a certain question (how did that bug stay latent for so long), and then I had to take the raw materials of the writeup that I did, and then do some diagramming to visualize the pattern of activity so I could understand it. In retrospect, it was worth it. I got a lot more insight here than: “root cause: latent bug”.
Now I just need to figure out how to tell this as a story without the benefit of a diagram.