Some quick thoughts on Citi’s $900M OOPSie

Making the rounds is the story of how Citi accidentally transferred $900 million dollars to various hedge funds. Citi then asked the funds to reverse the mistaken transfer, and while some of the funds did, others said, “no, it’s ours, and we’re keeping it”, and Citi took them to court, and lost. The wonderful finance writer Matt Levine has the whole story. At this center of this is horrible UX associated with internal software, you can see screenshots in Levine’s writeup. As an aside, several folks on the Hacker News thread recognized the UI widgets as having been built with Oracle Forms.

However, this post isn’t about a particular internal software package with lousy UX. (There is no shortage of such software packages in the world, ask literally anyone who deals with internal software).

Instead, I’m going to explore two questions:

How come we don’t hear about these sorts of accidental financial transactions more often?
How come financial organizations like Citibank don’t invest in improving internal software UX for reducing risk?

I’ve never worked in the financial industry, so I have no personal experience with this domain. But I suspect that accidental financial transactions, while rare, do happen from time to time. But what I suspect happens most of the time is that the institution that initiated the accidental transaction reaches out and explains what happens to the other institution, and they transfer the money back.

As Levine points out, there’s no finders keepers rule in the U.S. I suspect that there aren’t any organizations that have a risk scenario with the summary, “we accidentally transfer an enormous sum of money to an organization that is legally entitled to keep it.” because that almost never happens. This wasn’t a case of fraud. This was a weird edge case in the law where the money transferred was an accidental repayment of a loan in full, when Citi just meant to make an interest payment, and there’s a specific law about this scenario (in fact, Citi didn’t really want to make a payment at all, but they had to because of a technical issue).

Can you find any other time in the past where an institution accidentally transferred funds and the recipient was legally permitted to keep the money? If so, I’d love to hear it.

And, if it really is the case that these sorts of mistakes aren’t seen as a risk, then why would an organization like Citi invest in improving the usability of their internal tools? Heck, if you read the article, you’ll see that it was actually contractors that operate the software. It’s not like Citi would be more profitable if they were able to improve the usability of this software. “Who cares if it takes a contractor 10 minutes versus 30 minutes?” I can imagine an exec saying.

Don’t get me wrong: my day job is building internal tools, so I personally believe these tools add value. And I imagine that financial institutions invest in the tooling of their algorithmic traders, because correctness and development speed go directly to their bottom lines. But the folks operating the software that initiates these sorts of transactions? That’s just grunt work, nobody’s going to invest in improving those experiences.

In short, these systems don’t fall over all of the time because the systems aren’t just made up of horrible software. They’re made up of horrible software, and human beings who can exercise judgment when something goes wrong and compensate. Most of the time, that’s good enough.

Slack’s Jan 2021 outage: a tale of saturation

Laura Nolan of Slack recently published an excellent write-up of their Jan. 4, 2021 outage on Slack’s engineering blog.

One of the things that struck me about this writeup is the contributing factors that aren’t part of this outage. There’s nothing about a bug that somehow made its way into a production, or an accidentally incorrect configuration change, or how some corrupt data ended up in the database. On the other hand, it’s an outage story with multiple examples of saturation.

Saturation is a phrase often used by the safety science researcher David Woods: it refers to a system that is reaching the limit of what it can handle. If you’ve done software operations work, I bet you’ve encountered resource exhaustion, which is an example of saturation.

Saturation plays a big role in Woods’s model of the adaptive universe. In particular, in socio-technical systems, people will adapt in order to reduce the risk of saturation. In this post, I’m going to walk Laura’s write-up, highlighting all of the examples of saturation and how the system adopted to it. I’m going purely from the text of the original write-up, which means I’ll likely get some things wrong here.

Slack runs their infrastructure on AWS. In the beginning, they (like, I presume, all small companies) started with a single AWS account. And, initially, this worked out well.

In the beginning, Slack’s AWS footprint fit nicely into one account and VPC: that’s one happy cloud!

However, as Slack grew, it encountered it problems. Here’s a quote from Slack’s blog post Building the Next Evolution of Cloud Networks at Slack by Archie Gunasekara:

As our customer base grew and the tool evolved, we developed more services and built more infrastructure as needed. However, everything we built still lived in one big AWS account. This is when our troubles started. Having all our infrastructure in a single AWS account led to AWS rate-limiting issues, cost-separation issues, and general confusion for our internal engineering service teams.

That cloud isn’t looking so happy anymore

The above quote makes reference to three different categories of saturation. The first is a traditional sort of limit we software folks think of: they were running into AWS rate limits associated with an individual AWS account.

The other two limits are cognitive: the system made it harder for humans to deal with separating out costs and, it led to confusion for internal teams. I still see these as a form of saturation: as a system gets more difficult for humans to deal with, it effectively increases the cost of using the system, and it makes errors more likely.

And so, the Slack Cloud Engineering team adapted to meet this saturation risk by adopting AWS child accounts. From the linked blog post again:

Now the service teams could request their own AWS accounts and could even peer their VPCs with each other when services needed to talk to other services that lived in a different AWS account.

With continued growth, they eventually reached saturation again. Once again, this was the “it’s getting too hard” sort of saturation:

Having hundreds of AWS accounts became a nightmare to manage when it came to CIDR ranges and IP spaces, because the mis-management of CIDR ranges meant that we couldn’t peer VPCs with overlapping CIDR ranges. This led to a lot of administrative overhead.

VPC peering with multiple VPCs increases the cognitive load on the networking engineers

To deal with this risk of saturation, the cloud engineering team adapted again. They reached for new capabilities: AWS shared VPCs and AWS Transit Gateway Inter-Region Peering. By leveraging these technologies, they were able to design a network architecture that addressed their problems:

This solved our earlier issue of constantly hitting AWS rate limits due to having all our resources in one AWS account. This approach seemed really attractive to our Cloud Engineering team, as we could manage the IP space, build VPCs, and share them with our child account owners. Then, without having to worry about managing any of the overhead of setting up VPCs, route tables, or network access lists, teams were able to utilize these VPCs and build their resources on top of them.

Example of a transit gateway connecting multiple VPCs. Obviously, this does not accurately depict Slack’s actual network architecture.

Fast forward several months later. From Laura Nolan’s post:

On January 4th, one of our Transit Gateways became overloaded. The TGWs are managed by AWS and are intended to scale transparently to us. However, Slack’s annual traffic pattern is a little unusual: Traffic is lower over the holidays, as everyone disconnects from work (good job on the work-life balance, Slack users!). On the first Monday back, client caches are cold and clients pull down more data than usual on their first connection to Slack. We go from our quietest time of the whole year to one of our biggest days quite literally overnight. Our own serving systems scale quickly to meet these kinds of peaks in demand (and have always done so successfully after the holidays in previous years). However, our TGWs did not scale fast enough.

Traffic surge overwhelms the transit gateway

This is as clear an example of saturation as you can get: the incoming load increased faster than the transit gateways were able to cope. What’s really fascinating from this point on is the role that saturation plays in interactions with the rest of the system.

As too many of us know, clients experience a saturated network as an increase in latency. When network latency goes up, the threads in a service spend more of their time sitting there waiting for the bits to come over the network, which means that CPU utilization goes down.

Slack’s web tier autoscales on CPU utilization, so when the network started dropping packets, the instances in the web tier spent more of their time blocked, and CPU went down, which triggered the AWS autoscaler to downscale the web tier.

A server’s CPU utilization falls as it waits for packets that are being dropped by the overloaded transit gateway

However, the web tier has a scaling policy that rapidly upscales if thread utilization gets too high. (At Netflix, we use the term hammer rule to describe these type of emergency scale-up rule).

Too many threads are in use, waiting for those packets. Time for the scale-up hammer!

Once the new instances come online, an internal Slack service named provision-service is responsible for setting up these new instances so that they can serve traffic. And here, we see more saturation issues (emphasis mine).

Provision-service needs to talk to other internal Slack systems and to some AWS APIs. It was communicating with those dependencies over the same degraded network, and like most of Slack’s systems at the time, it was seeing longer connection and response times, and therefore was using more system resources than usual. The spike of load from the simultaneous provisioning of so many instances under suboptimal network conditions meant that provision-service hit two separate resource bottlenecks (the most significant one was the Linux open files limit, but we also exceeded an AWS quota limit).

While we were repairing provision-service, we were still under-capacity for our web tier because the scale-up was not working as expected. We had created a large number of instances, but most of them were not fully provisioned and were not serving. The large number of broken instances caused us to hit our pre-configured autoscaling-group size limits, which determine the maximum number of instances in our web tier.

The provision service is saturated, as is the autoscaling group, although many of the instances aren’t serving traffic because they aren’t fully provisioned.

Through a combination of robustness mechanisms (load balancer panic mode, retries, circuit breakers) and the actions of human operators, the system is restored to health.

As operators, we strive to keep our systems far from the point of saturation. As a consequence, we generally don’t have much experience with how the system behaves as it approaches saturation. And that makes these incidents much harder to deal with.

Making things worse, we can’t ever escape the risk of saturation. Often we won’t know that a limit exists until the system breaches it.

Error budgets and the legacy of Herbert Heinrich

Here’s a question that all of us software developers face: How can we best use our knowledge about the past behavior of our system to figure out where we should be investing our time?

One approach is to use a technique from the SRE world called error budgets. Here are a few quotes from the How to Use Error Budgets chapter of Alex Hidalgo’s book: Implementing Service Level Objectives:

Measuring error budgets over time can give you great insight into the risk factors that impact your service, both in terms of frequency and severity. By knowing what kinds of events and failures are bad enough to burn your error budget, even if just momentarily, you can better discover what factors cause you the most problems over time. p71 [emphasis mine]

The basic idea is straightforward. If you have error budget remaining, ship new features and push to production as often as you’d like; once you run out of it, stop pushing feature changes and focus on relaiability instead. p87

Error budgets give you ways to make decisions about your service, be it a single microservice or your company’s entire customer-facing product. They also give you indicators that tell you when you can ship features, what your focus should be, when you can experiment, and what your biggest risk factors are. p92

The goal is not to only react when your users are extremely unhappy with you—it’s to have better data to discuss where work regarding your service should be moving next. p354

That sounds reasonable, doesn’t it? Look at what’s causing your system to break, and if it’s breaking too often, use that as a signal to address those issues that are breaking it. If you’ve been doing really well reliability-wise, an error budget gives you margin to do some riskier experimentation in production like chaos engineering or production load testing.

I have two issues with this approach, a smaller one and a larger one. I’ll start with the smaller one.

First, I think that if you work on a team where the developers operate their own code (you-build-it, you-run-it), and where the developers have enough autonomy to say, “We need to focus more development effort on increasing robustness”, then you don’t need the error budget approach to help you decide when and where to spend your engineering effort. The engineers will know where the recurring problems are because they feel the operational pain, and they will be able to advocate for addressing those pain points. This is the kind of environment that I am fortunate enough to work in.

I understand that there are environments where the developers and the operators are separate populations, or the developers aren’t granted enough autonomy to be able to influence where engineering time is spent, and that in those environments, an error budget approach would help. But I don’t swim in those waters, so I won’t say any more about those contexts.

To explain my second concern, I need to digress a little bit to talk about Herbert Heinrich.

Herbert Heinrich worked for the Travelers Insurance Company in the first half of the twentieth century. In the 1920s, he did a study of workplace accidents, examining thousands of claims made by companies that held insurance policies with Travelers. In 1931, he published his findings in a book: Industrial Accident Prevention: A Scientific Approach.

Heinrich’s work showed a relationship between the rates of near misses (no injury), minor injuries, and major injuries. Specifically: for every major injury, there are 29 minor injuries, and 300 no-injury accidents. This finding of 1:29:300 became known as the accident triangle.

My reproduction of Heinrich’s accident pyramid. To see the original, check out The Heinrich/Bird safety pyramid: Pioneering research has become a safety myth at risk-engineering.org.

One implication of the accident triangle is that the rate of minor issues gives us insight into the rate of major issues. In particular, if we reduce the rate of minor issues, we reduce the risk of major ones. Or, as Heinrich put it: Moral—prevent the accidents and the injuries will take care of themselves.

Heinrich’s work has since been criticized, and subsequent research has contradicted Heinrich’s findings. I won’t repeat the criticisms here (see Foundations of Safety Science by Sidney Dekker for details), but I will cite counterexamples mentioned in Dekker’s book:

The Deepwater Horizon offshore drilling rig saw six years of injury-free and incident-free performance before the explosion in 2010. (It even won a SAFE award from the U.S. Minerals Management Service in 2008 for its perfect safety record!)

Arnold Barnett and Alexander Wang found a negative correlation between nonfatal accident/incident rates and passenger-mortality risk among air carriers. That is, carriers that had more non-fatal incidents had a lower risk of fatalities. (Passenger-mortality Risk Estimates Provide Perspectives About Airline Safety, Flight Safety Digest, April 2000).

Antti Saloniemi and Hanna Oksanen found a negative correlation between incident rate and fatalities in the construction industry in Finland (Accidents and fatal accidents—some paradoxes, Safety Science, Volume 29, Issue 1, June 1998).

Fred Sherratt and Andrew Dainty found that construction companies in the UK that had an explicit policy of zero accidents saw more major injuries and fatal accidents than companies that did not have a zero accident policy (UK construction safety: a zero paradox?, Policy and Practice in Health and Safety, Volume 15, Issue 2, 2017).

So, what does any of this have to do with error budgets? At a glance, error budgets don’t seem related to Heinrich’s work at all. Heinrich was focused on safety, where the goal is to reduce injuries as much as possible, in some cases explicitly having a zero goal. Error budgets are explicitly not about achieving zero downtime (100% reliability), they’re about achieving a target that’s below 100%.

Here are the claims I’m going to make:

Large incidents are much more costly to organizations than small ones, so we should work to reduce the risk of large incidents.
Error budgets don’t help reduce risk of large incidents.

Here’s Heinrich’s triangle redrawn:

An error-budget-based approach only provides information on the nature of minor incidents, because those are the ones that happen most often. Near misses don’t impact the reliability metrics, and major incidents blow them out of the water.

Heinrich’s work assumed a fixed ratio between minor accidents and major ones: reduce the rate of minor accidents and you’d reduce the rate of major ones. By focusing on reliability metrics as a primary signal for providing insight into system risk, you only get information about these minor incidents. But, if there’s no relationship between minor incidents and major ones, then maintaining a specific reliability level doesn’t address the issues around major incidents at all.

An error-budget-based approach to reliability implicitly assumes there is a connection between reliability metrics and the risk of a large incident. This is the thread that connects to Heinrich: the unstated idea that doing the robustness work to address the problems exposed by the smaller incidents will decrease the risk of the larger incidents.

In general, I’m skeptical about relying on predefined metrics, such as reliability, for getting insight into the risks of the system that could lead to big incidents. Instead, I prefer to focus on signals, which are not predefined metrics but rather some kind of information that has caught your attention that suggests that there’s some aspect of your system that you should dig into a little more. Maybe it’s a near-miss situation where there was no customer impact at all, or maybe it was an offhand remark made by someone in Slack. Signals by themselves don’t provide enough information to tell you where unseen risks are. Instead, they act as clues that can help you figure out where to dig to get more details. This is what the Learning from Incidents in Software movement is about.

I’m generally skeptical of metrics-based approaches, like error budgets, because they reify. The things that get measured are the things that get attention. I prefer to rely on qualitative approaches that leverage the experiment judgment of engineers. The challenge with qualitative approaches is that you need to expose the experts to the information they need (e.g., putting the software engineers on-call), and they need the space to dig into signals (e.g., allow time for incident analysis).

Making sense of what happened is hard

Scott Nasello recently introduced me to Dr. Hannah Harvey’s The Art of Storytelling. I’m about halfway through her course, and I absolutely love it, and I keep thinking about it in the context of learning from incidents. While I have long been an advocate of using narrative structure when writing up incidents, Harvey’s course focuses on oral storytelling, which is a very different sort of format.

In this context, I was thinking about an operational surprise that happened on my team a few months ago, so that I could use it as raw material to construct an oral story about it. But, as I reflected on it, and read my own (lengthy) writeup, I realized that there was one thing I didn’t fully understand about what happened.

During the operational surprise, when we attempted to remediate the problem by deploying a potential fix into production, we hit a latent bug that had been merged into the main branch ten days earlier. As i was re-reading the writeup, there was something I didn’t understand. How did it come to be that we went ten days without promoting that code from the main branch of our repo to the production environment?

To help me make sense of what happened, I drew a diagram of the development events that lead up to the surprise. Fortunately, I had documented those events thoroughly in the original writeup. Here’s the diagram I created. I used this diagram to get some insight into how bug T2, which was merged into our repo on day 0, did not manifest in production until day 10.

This diagram will take some explanation, so bear with me.

There are four bugs in this story, denoted T1,T2, A1, A2. The letters indicate the functionality associated with the PR that introduced them:

T1, T2 were both introduced in a pull request (PR) related to refactoring of some functionality related to how our service interacts with Titus.
A1, A2 were both introduced in a PR related to adding functionality around artifact metadata.

Note that bug T1 masked T2, and bug A1 masked A2.

There are three vertical lines, which show how the bugs propagated to different environments.

main (repo) represents code in the main branch of our repository.
staging represents code that has been deployed to our staging environment.
prod represents code that has been deployed to our production environment.

Here’s how the colors work:

gray indicates that the bug is present in an environment, but hasn’t been detected
red indicates that the effect of a bug has been observed in an environment. Note that if we detect a bug in the prod environment, that also tells us that the bug is in staging and the repo.
green indicates the bug has been fixed

If a horizontal line is red, that means there’s a known bug in that environment. For example, when we detect bug T1 in prod on day 1, all three lines go red, since we know we have a bug.

A horizontal line that is purple means that we’ve pinned to a specific version. We unpinned prod on day 10 before we deployed.

The thing I want to call out in this diagram is the color in the staging line. once the staging line turns red on day 2, it only turns black on day 5, which is the Saturday of a long weekend, and then turns red again on the Monday of the long weekend. (Yes, some people were doing development on the Saturday and testing in staging on Monday, even though it was a long weekend. We don’t commonly work on weekends, that’s a different part of the story).

During this ten day period, there was only a brief time when staging was in a state we thought was good, and that was over a weekend. Since we don’t deploy on weekends unless prod is in a bad state, it makes sense that we never deployed from staging to prod until day 10.

The larger point I want to make here is that getting this type of insight from an operational surprise is hard, in the sense that it takes a lot of effort. Even though I put in the initial effort to capture the development activity leading up to the surprise when I first did the writeup, I didn’t gain the above insight until months later, when I tried to understand this particular aspect of it. I had to ask a certain question (how did that bug stay latent for so long), and then I had to take the raw materials of the writeup that I did, and then do some diagramming to visualize the pattern of activity so I could understand it. In retrospect, it was worth it. I got a lot more insight here than: “root cause: latent bug”.

Now I just need to figure out how to tell this as a story without the benefit of a diagram.

Conscription devices, boundary objects, and GDocs

In the paper Flexible Sketches and Inflexible Data Bases: Visual Communication, Conscription Devices, and Boundary Objects in Design Engineering, Kathryn Henderson writes about the role that sketches and drawings play in the work of mechanical engineers.

A conscription device is something that can be used to help recruit other people to get involved in a task: mechanical engineers collaborate using diagrams. These diagrams play such a strong role that the engineers find that they can’t work effectively without them. From the paper:

If a visual representation is not brought to a meeting of those involved with the design, someone will sketch a facsimile on a white board (present in all engineering conference rooms) when communication begins to falter, or a team member will leave the meeting to fetch the crucial drawings so group members will be able to understand one another.

A boundary object is an artifact that can be consumed by different stakeholders, who use the artifact for different purposes. Henderson uses the example of the depiction of a welded joint in a drawing, which has different meanings for the designer (support structure) than it does for someone working in the shop (labor required to do the weld). A shop worker might see the drawing and suggest a change that would save welds (and hence labor):

Detail renderings are one of the tightly focused portions that make up the more flexible whole of a drawing set. For example, the depiction of a welded joint may stand for part of the support structure to the designer and stand for labor expended to those in the shop.If the designer consults with workers who suggest a formation that will save welds and then incorporates the advice, collective knowledge is captured in the design. One small part of the welders’ tacit knowledge comes to be represented visually in the drawing. Hence the flexibility of the sketch or drawing as a boundary object helps in enlisting the aid and knowledge of additional participants.

Because we software engineers don’t work in a visual medium, we don’t work from visual representations the way that mechanical engineers do. However, we still have a need to engage with other engineers to work with us, and we need to communicate with different stakeholders about the software that we build.

A few months ago, I wrote up a Google doc with a spec for some proposed new functionality for a system that I work on. It included scenario descriptions that illustrated how a user would interact with the system. I shared the doc out, and got a lot of feedback, some of it from potential users of the system who were looking for additional scenarios, and some from adjacent teams who were concerned about the potential misuse of the feature for unintended purposes.

This sort of Google doc does function like a conscription device and boundary object. Google makes it easy to add comments to a doc. Yes, comments don’t scale up well, but the ease of creating a comment makes Google docs effective as potential conscription devices. If you share the doc out, and comments are enabled, people will comment.

I also found that writing out scenarios, little narrative descriptions of people interacting with the system, made it easier for people to envision what using the system will be like, and so I consequently got feedback from different types of stakeholders.

My point here is not that scenarios written in Google docs are like mechanical engineering drawings: those are very different kinds of artifacts that play different roles. Rather, the point is that properties of an artifact can affect how people collaborate to get engineering work done. You probably don’t think of a Google doc as a software engineering tool. But it can be an extremely powerful one.

Coding as a tool of thought

With apologies to Ken Iverson.

Architects draw detailed plans before a brick is laid or a nail is hammered. Programmers and software engineers don’t. Can this be why houses seldom collapse and programs often crash?
Blueprints help architects ensure that what they are planning to build will work. “Working” means more than not collapsing; it means serving the required purpose. Architects and their clients use blueprints to understand what they are going to build before they start building it.
But few programmers write even a rough sketch of what their programs will do before they start coding.
Leslie Lamport, Why We Should Build Software Like We Build Houses

“My instinct is to go right to the board. I’m very graphic oriented. I can’t talk more than ten minutes without I [sic] start drawing pictures when we’re talking about the things I do. Even if I’m talking sports, I invariably start diagramming what’s going on. I feel comfortable with it or find it very effective.”
This designer, like the newly promoted engineer at Selco who fought to get her drafting board back, is pointing out his dependence on the visual process, of drawing both to communicate and to think out the initial design. He also states that the visual and manual thought process of drawing precedes the formulation of the written specifications for the project. Like the Selco designers and those at other sites, he emphasized the importance of drawing processes to work out ideas. [emphasis added]
Kathryn Henderson, On Line and On Paper: Visual Representations, Visual Culture, and Computer Graphics in Design Engineering

The quote above by Kathryn Henderson illustrates how mechanical engineers use the act of drawing to help them work on the design problem. By generating sketches and drawings, they develop a better understanding of the problem they trying to solve. They use drawing as a tool to help them think, to work through the problem.

As software engineers, we don’t work in a visual medium in the way that mechanical engineers do. And yet, we also use tools to help us think through the problem. It just so happens that the tool we use is code. I can’t speak for other developers, but I certainly use the process of writing code to develop a deeper understanding of the problem I’m trying to solve. As I solve parts of the problem with code, my mental model of the problem space and solution space develops throughout the process.

I think we use coding this way (I certainly do), because it feels to me like the fastest way to evolve this knowledge. If I had some sort of equivalent mechanism for sketching that was faster than coding for developing my understanding, I’d use it. But I know of no mechanism that’s actually faster than coding that will let me develop my understanding of the solution I’m working on. It just so happens that the quickest solution, code, is the same medium as the artifact that will ultimately end up in production. A mechanical engineer can never ship their sketches, but we can ship our code.

And this is a point that I think Leslie Lamport misses. I’m personally familiar with a number of different techniques for modeling in software, including TLA+, Alloy, statecharts, and decision tables. I’ve used them all, they are excellent tools for reasoning about the complexity of the system. But none of these tools really fulfill the role that sketching does for mechanical engineers (although Alloy’s fast feedback for incremental model building is a nice step in this direction).

TLA+ in particular is a wonderful tool. I’ve used it for things like understanding the linearizability paper, beating the CAP theorem, finding cycles in linked lists, solving the river crossing problem, and proving leftpad. But if you’re looking for an analogy in mechanical engineering, TLA+ is much closer to finite element analysis than it is to sketches.

Developers jump to coding not because they are sloppy, but because they have found it to be the most effective tool for sketching, for thinking about the problem and getting quick feedback as they construct their solution. And constructing a representation to develop a better understanding using the best tools they have available for the job to get quick feedback is what engineers do.

Uber’s adventures in the adaptive universe

It’s 2016, and Uber engineers are facing a problem. Their software system has become brittle: many in the organization feel that it’s too hard to make changes to it without breaking things.

And so, they adapt: they build a new architecture, one that’s designed to enable teams to move more quickly. As part of the re-architecture, they reach for a new technology to rewrite the iOS client in: the Swift language.

The new architecture experiment is deemed a success, and is rolled out to the entire company. A florescence ensues in the organization, as teams excitedly migrate to the new architecture and experience a boost to their development productivity.

However, as development against the new architecture ramps up, anomalies related to Swift begin to emerge. Because of implementation details in the Swift linker, Apple recommends limiting the number of shared libraries to six: Uber has ninety-two, and the number is growing. The linker is saturated, and as a result, app startup is extremely slow. It takes eight to twelve seconds (!) to start up the app. The rewrite was supposed to yield a faster iOS app, and it’s slower than the previous version!

So the engineers adapt. They discover they can work around the problem by putting all of the code in the main executable instead of linking it via libraries, eliminating the startup delay. Unfortunately, to do this would require a huge code change because an implementation detail of Swift, but they find another workaround: an enterprising engineer writes a custom script to relink intermediate object files that avoids the need to change the code. And it works!

But they encounter another anomaly: the Swift-based iOS app binary is big… too big. It’s so big that they’re running into the Apple cellular download limit.

For users who want to download the Uber app to their iPhones over the cellular network, Apple places a hard limit of 100MB on the size of the download: any bigger, and the phone won’t let you download it unless you’re on wifi. Once again, the Uber engineers are hitting a saturation point, only now the limit is space instead of time. To add insult to injury, their workaround to deal with the startup time problem exacerbated the size problem!

There are further workarounds they can do to save space, like replace structs with classes. But it isn’t enough. The data scientists run an experiment to estimate the cost to the organization of the app breaching the cellular download limit: and the risk of catastrophic. It turns out that many people download the app for the first time on a cellular network. The estimated cost to the business is orders of magnitude more than the cost of the rewrite.

The engineers have to make some hard choices. Their original plan was to bundle the old and new versions of the app in the same app bundle, so that they could do a slow rollout to reduce the blast radius if there was a problem with the new version. They are facing a goal conflict, and so they make a sacrifice judgment. They remove the old version of this app. They call this the “Yolo” release strategy.

They face another goal conflict: they can take advantage of a new capability in iOS 9 that will reduce the binary size by 50%, but to do so they have to drop support for iOS 8. They estimate that this will decision will have a dollar of eight figures. With only a week to go before release, they drop iOS 8 and eat the cost to come get under the cellular download limit.

The engineers believe that dropping iOS 8 support should provide them with enough headroom to figure out a strategy for dealing with the 100 MB download limit, given the project slowdown in the growth of the app. But their model of the growth rate is wrong: the app is growing too quickly. There’s a risk of decompensation, of not being able to work around the growth rate of the app.

And so the engineers adapt. They form a strike team to come up with approaches for bringing the app size under control. They employ workarounds such as deleting unused features, checking for expensive code patterns, and rewriting the Apple Watch app in Objective C.

An Uber engineer in the Amsterdam office comes up with an innovative work around: he uses an annealing algorithm to re-order the Swift compiler’s optimization passes to minimize the size of the resulting binary. And it works! It also terrifies the Swift compiler engineers, as they haven’t tested running the optimization passes in arbitrary orders.

And yet, the risk of decompensation is ever-present: the strike team worries about their space saving wins will not be able to keep pace with the growth of the applications.

Fortunately, Apple moves the boundary: increasing the cellular download limit to 150 MB and introducing new size optimization features in the Swift compiler.

The above is my retelling of a Twitter thread by McLaren Stanley, a former Uber engineer. I highly recommend reading the original thread in full. My writing above is based solely on that thread, I don’t have any additional information, and I probably got some stuff wrong. I also created a concept map based on Stanley’s thread.

Alright folks, gather round and let me tell you the story of (almost) the biggest engineering disaster I’ve ever had the misfortune of being involved in. It’s a tale of politics, architecture and the sunk cost fallacy [I’m drinking an Aberlour Cask Strength Single Malt Scotch] https://t.co/KWnZKIpycp
— McLaren Stanley (@StanTwinB) December 10, 2020

I wrote the post above using the frame of what the researcher David Woods calls the adaptive universe. I tried to cast events in terms of people undergoing pressure, encountering risks of saturation, and then adapting in the face of that pressure, and those adaptations leading to reverberations that introduce unexpected change in the system. Woods calls these adaptive cycles.

I’ve previously written briefly about the adaptive universe, but to learn more about this model, check out this material by Woods:

Software as a limited medium

The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination. Few media of creation are so flexible, so easy to polish and rework, so readily capable of realizing grand conceptual structures.
Fred Brooks, The Mythical Man-Month

We software engineers don’t work in a physical medium the way, say, civil, mechanical, electrical, or chemical engineers do. Yes, our software does run on physical machines, and we are not exempt from dealing with limits. But, as captured in that Fred Brooks quote above, there’s a sense in which we software folk feel that we are working in a medium that is limited only by our own minds, by the complexity of these ethereal artifacts we create. When a software system behaves in an unexpected way, we consider it a design flaw: the engineer was not sufficiently smart.

And, yet, contra Brooks, software is a limited medium. Let’s look at two areas where that’s the case.

Software is discrete in a way that the world isn’t

We persist our data in databases that have schemas, which force us to slice up our information in ways that we can represent. But the real world is not so amenable to this type of slicing: it’s a messy place. The mismatch between the messiness of the real world and the structured nature of software data representations results in a medium that is not well-suited to model the way humans treat concepts such as names or time.

Software as a medium, and data storage in particular, encourages over-simplification of the world, because we need to categorize our data, figure out which tables to store it in and what values those columns should have, and so many items in the world just aren’t easy to model well like that.

As an example, consider a common question in my domain, software deployment: is a cluster up? We have to make a decision about that, and yet the answer is often “it depends: why do you want to know”? But that’s not what software as a medium encourages. Instead, we pick a definition of “up”, implement it, and then hope that it meets most needs, knowing it won’t. We can come up with other definitions for other circumstances, but we can’t be comprehensive, and we can’t be flexible. We have to bake in those assumptions.

And so, just like all engineers, given our time and resource constraints, we have to make over-simplifications to get our work done. William Kent wrote a whole book on this topic called Data and Reality: A Timeless Perspective on Perceiving and Managing Information in Our Imprecise World (h/t Hillel Wayne).

Software systems are limited in how they integrate inputs

In the book Problem Frames, Michael Jackson describes several examples of software problems. One of them is a system for counting how many cars pass by on a street. The inputs are two sensors that emit a signal when the cars drive over them. Those two sensors provide a lot less input than a human would have sitting by the side of the road and counting the cars go by.

As humans, when we need to make decisions, we can flexibly integrate a lot of different information signals. If I’m talking to you, for example, I can listen to what you’re saying, and I can also read the expressions on your face. I can make judgments based on how you worded your Slack message, and based on how well I already know you. I can use all of that different information to build a mental model of your actual internal state. Software isn’t like that: we have to hard-code, in advance, the different inputs that the software system will use to make decisions. Software as a medium is inherently limited in modeling external systems that it interacts with.

A couple of months ago, I wrote a blog post titled programming means never getting to say “it depends”, where I used the example of an alerting system: when do you alert a human operator of a potential problem? As humans, we can develop mental models of the human operator: “does the operator already know about X? Wait, I see that they are engaged based on their Slack messages, so I don’t need to alert them, they’re already on it.”

Good luck building an alerting system that constructs a model of the internal state of a human operator! Software just isn’t amenable to incorporating all of the possible signals we might get from a system.

Recognizing the limits of software

The lesson here is that there are limits to how well software system can actually perform, given the limits of software. It’s not simply a matter of managing complexity or avoiding design flaws: yes, we can always build more complex schemas to handle more cases, and build our systems to incorporate large input sets, but this is the equivalent of adding epicycles. Incorrect categorizations and incorrect automated decisions are inevitable, no matter how complex our systems become. They are inherent to the nature of software systems. We’re always going to need to have humans-in-the-loop to make up for these sorts of shortcomings.

The goal is not to build better software systems, but how to build better joint cognitive systems that are made up of humans and software together.

Top-down code reviews

I’ve long been frustrated by the task of code reviews. Often, the pull request (PR) I’m reviewing involves a part of the codebase I’m not intimately familiar with. I read it, not quite understanding it, looking to see if I can offer some sort of useful feedback, and typically that feedback would be on the micro level (e.g., “you can simplify this function by calling this other library function instead”).

I recently started experimenting with a new review approach that I’m going to call top down code review. Here’s how it works: I start by understanding the code well enough so that I can write my own version of the pull request message, describing the PR in my own words. After I’ve done this, then I provide feedback.

I call this approach “top down” because the review that I end up generating starts with a “top down” description of the PR: the problem it’s trying to solve, and the solution approach, before diving into describing notable implementation details. Here are the reviews I’ve done in this style so far:

I’ve been finding this approach useful because it forces me to come to terms with how well I really understand the PR. If I can’t explain the PR in my own words, then I don’t really understand it. It also helps me figure out what questions to ask the original author to help clarify things for me.

I also get more of a sense of closure after doing the review. Even if I had no feedback to give, I understand the changes in a way that I didn’t before.

Why you should write up your own incident

You shouldn’t write up your own incident if you can avoid it. To write up an incident well, you need to be able to capture the perspectives of the different people who were involved. If the write-up author was also one of the responders, then the writeup will be biased towards their perspective, at the expense of capturing the perspectives of the other engineers who were engaged.

Unfortunately, most organizations haven’t committed the resources to support doing independent incident investigations. I happen to privileged enough to work at a company that has hired specialists who are skilled at doing independent incident investigations (J. Paul Reed and Jessica DeVita), Once upon a time (last year, to be precise), I was one of those independent incident investigators, before I transitioned back to being a software engineer.

However, even at my employer, we don’t have the resources to do an independent investigation for every single operational surprise that happens, and so the common case is still that a team has to investigate its own operational surprises.

Recently, I was one of the responders to one of these operational surprises. And, since I’m an advocate of teams putting in the effort to write up their operational surprises and share them with the org, I committed to doing that for my team.

During the operational surprise, we identified that certain database rows weren’t being updated, but we struggled to identify why they weren’t being updated. In the moment, We suspected the problem was somehow related to this function, which is responsible for updating those database rows. We believed (correctly, in hindsight) that the function was being called, because a log statement immediately preceding that function invocation appeared in the logs. But, somehow, the database updates weren’t taking effect.

In the moment, I was looking into whether there was something about the database itself that was preventing writes: perhaps some sort of database lock that was blocking updates? To investigate that, I manually translated the code from jOOQ library calls to raw SQL so I could run the queries directly against the database and see what happened.

In the end, it turned out that the problem was not related to the database itself, but to Kotlin code inside that function that was throwing an exception. It was erroring because the code made certain assumptions about the format of version strings, and those assumptions had become invalid over time. When this code hit a version string it couldn’t process, it threw an exception and triggered a transaction rollback.

After we remediated, when I looked back on the events of the day, I thought “Boy, I sure did waste a big chunk of time manually translating that code to SQL, when the problem wasn’t related to the database at all.“

Later on, when I put my incident investigator hat on and pored over the Slack messages, I discovered something. While I was working to understand the code to translate it, I discovered that one of the queries in that function was too broad. Under normal circumstances, the broadness of the query wasn’t impacting the correctness of the function (the query after it was narrower) or the performance, but during the operational surprise it was increasing the blast radius of the issue. Narrowing the scope of that query was an important part of remediating the incident.

The thing is, until I was investigating the incident, I didn’t realize that I had learned about the broad query issue because I was working to translate the code into SQL. That work I did had real value: it helped us resolve the issue.

Ever since I’ve been bitten by the learning from incidents bug, I’ve been a believer in the value of using an independent investigator. But this is the first time I really had this first-hand experience of I learned something new about my own work in resolving the incident, even though I was there, because of the post-incident investigation work. It was quite a visceral realization.

And so, while you really should take advantage of independent investigators if resources permit, if you’ve worked as an independent investigator and then transition to a role which includes incident response, I recommend trying to write up one of your own incidents, at least once. It really reinforces how much more can be learned from an incident by doing a good investigation.