It’s 2016, and Uber engineers are facing a problem. Their software system has become brittle: many in the organization feel that it’s too hard to make changes to it without breaking things.
And so, they adapt: they build a new architecture, one that’s designed to enable teams to move more quickly. As part of the re-architecture, they reach for a new technology to rewrite the iOS client in: the Swift language.
The new architecture experiment is deemed a success, and is rolled out to the entire company. A florescence ensues in the organization, as teams excitedly migrate to the new architecture and experience a boost to their development productivity.
However, as development against the new architecture ramps up, anomalies related to Swift begin to emerge. Because of implementation details in the Swift linker, Apple recommends limiting the number of shared libraries to six: Uber has ninety-two, and the number is growing. The linker is saturated, and as a result, app startup is extremely slow. It takes eight to twelve seconds (!) to start up the app. The rewrite was supposed to yield a faster iOS app, and it’s slower than the previous version!
So the engineers adapt. They discover they can work around the problem by putting all of the code in the main executable instead of linking it via libraries, eliminating the startup delay. Unfortunately, to do this would require a huge code change because an implementation detail of Swift, but they find another workaround: an enterprising engineer writes a custom script to relink intermediate object files that avoids the need to change the code. And it works!
But they encounter another anomaly: the Swift-based iOS app binary is big… too big. It’s so big that they’re running into the Apple cellular download limit.
For users who want to download the Uber app to their iPhones over the cellular network, Apple places a hard limit of 100MB on the size of the download: any bigger, and the phone won’t let you download it unless you’re on wifi. Once again, the Uber engineers are hitting a saturation point, only now the limit is space instead of time. To add insult to injury, their workaround to deal with the startup time problem exacerbated the size problem!
There are further workarounds they can do to save space, like replace structs with classes. But it isn’t enough. The data scientists run an experiment to estimate the cost to the organization of the app breaching the cellular download limit: and the risk of catastrophic. It turns out that many people download the app for the first time on a cellular network. The estimated cost to the business is orders of magnitude more than the cost of the rewrite.
The engineers have to make some hard choices. Their original plan was to bundle the old and new versions of the app in the same app bundle, so that they could do a slow rollout to reduce the blast radius if there was a problem with the new version. They are facing a goal conflict, and so they make a sacrifice judgment. They remove the old version of this app. They call this the “Yolo” release strategy.
They face another goal conflict: they can take advantage of a new capability in iOS 9 that will reduce the binary size by 50%, but to do so they have to drop support for iOS 8. They estimate that this will decision will have a dollar of eight figures. With only a week to go before release, they drop iOS 8 and eat the cost to come get under the cellular download limit.
The engineers believe that dropping iOS 8 support should provide them with enough headroom to figure out a strategy for dealing with the 100 MB download limit, given the project slowdown in the growth of the app. But their model of the growth rate is wrong: the app is growing too quickly. There’s a risk of decompensation, of not being able to work around the growth rate of the app.
And so the engineers adapt. They form a strike team to come up with approaches for bringing the app size under control. They employ workarounds such as deleting unused features, checking for expensive code patterns, and rewriting the Apple Watch app in Objective C.
An Uber engineer in the Amsterdam office comes up with an innovative work around: he uses an annealing algorithm to re-order the Swift compiler’s optimization passes to minimize the size of the resulting binary. And it works! It also terrifies the Swift compiler engineers, as they haven’t tested running the optimization passes in arbitrary orders.
And yet, the risk of decompensation is ever-present: the strike team worries about their space saving wins will not be able to keep pace with the growth of the applications.
Fortunately, Apple moves the boundary: increasing the cellular download limit to 150 MB and introducing new size optimization features in the Swift compiler.
The above is my retelling of a Twitter thread by McLaren Stanley, a former Uber engineer. I highly recommend reading the original thread in full. My writing above is based solely on that thread, I don’t have any additional information, and I probably got some stuff wrong. I also created a concept map based on Stanley’s thread.
I wrote the post above using the frame of what the researcher David Woods calls the adaptive universe. I tried to cast events in terms of people undergoing pressure, encountering risks of saturation, and then adapting in the face of that pressure, and those adaptations leading to reverberations that introduce unexpected change in the system. Woods calls these adaptive cycles.
I’ve previously written briefly about the adaptive universe, but to learn more about this model, check out this material by Woods:
Thanks for saving this story from Twitter. I think stories like this are ten or more times more valuable for learning than a typical post in Uber’s engineering blog. I’ve written about this here: https://engineeringideas.substack.com/p/study-the-style-of-doing-science