Any change can break us, but we can’t treat every change the same

Here are some excerpts from an incident story told by John Allspaw about his time at Etsy (circa 2012), titled Learning Effectively From Incidents: The Messy Details.

In this story, the site goes down:

September 2012 afternoon, this is a tweet from the Etsy status account saying that there’s an issue on the site… People said, oh, the site’s down. People started noticing that the site is down.

Possibly the referenced issue?

This is a tough outage: the web servers are down so hard that they aren’t even reachable:

And people said, well, actually it’s going to be hard to even deploy because we can’t even get to the servers. And people said, well, we can barely get them to respond to a ping. We’re going to have to get people on the console, the integrated lights out for hard reboots. And people even said, well, because we’re talking about hundreds of web servers. Could it be faster, we could even just power cycle these. This is a big deal here. So whatever it wasn’t in the deploy that caused the issue, it made hundreds of web servers completely hung, completely unavailable.

One of the contributors? A CSS change to remove support for old browsers!

And one of the tasks was with the performance team and the issue was old browsers. You always have these workarounds because the internet didn’t fulfill the promise of standards. So, let’s get rid of the support for IE version seven and older. Let’s get rid of all the random stuff. …
And in this case, we had this template-based template used as far as we knew everything, and this little header-ie.css, was the actual workaround. And so the idea was, let’s remove all the references to this CSS file in this base template and we’ll remove the CSS file.

How does a CSS change contribute to a major outage?

The request would come in for something that wasn’t there, 404 would happen all the time. The server would say, well, I don’t have that. So I’m going to give you a 404 page and so then I got to go and construct this 404 page, but it includes this reference to the CSS file, which isn’t there, which means I have to send a 404 page. You might see where I’m going back and forth, 404 page, fire a 404 page, fire a 404 page. Pretty soon all of the 404s are keeping all of the Apache servers, all of the Apache processes across hundreds of servers hung, nothing could be done.

I love this story because a CSS change feels innocuous. CSS just controls presentation, right? How could that impact availability? From the story (emphasis mine)

And this had been tested and reviewed by multiple people. It’s not all that big of a deal of a change, which is why it was a task that was sort of slated for the next person who comes through boot camp in the performance team.

The reason a CSS change can cascade into an outage is that in a complex system there are all of these couplings that we don’t even know are there until we get stung by them.

One lesson you might take away from this story is “you should treat every proposed change like it could bring down the entire system”. But I think that’s the wrong lesson. The reason I think so is because of another constraint we all face: finite resources. Perhaps in a world where we always had an unlimited amount of time to make any changes, we could take this approach. But we don’t live in that world. We only have a fixed number of hours in a week, which means we need to budget our time. And so we make judgment calls on how much time we’re going to spend on manually validating a change based on how risky we perceive that change to be. When I review someone else’s pull request, for example, the amount of effort I spend on it is going to vary based on the nature of the change. For example, I’m going to look more closely at changes to database schemas than I am to changes in log messages.

But that means that we’re ultimately going to miss some of these CSS-change-breaks-the-site kinds of changes. It’s fundamentally inevitable that this is going to happen: it’s simply in the nature of complex systems. You can try to add process to force people to scrutinize every change with the same level of effort, but unless you remove schedule pressure, that’s not going to have the desired effect. People are going to make efficiency-thoroughness tradeoffs because they are held accountable for hitting their OKRs, and they can’t achieve those OKR if they put in the same amount of effort to evaluate every single production change.

Given that we can’t avoid such failures, the best we can do is to be ready to respond to them.

Leave a comment