You can’t judge risk in hindsight

A while back, the good folks at Google SRE posted an article titled Lessons Learned from Twenty Years of Site Reliability Engineering. There’s some great stuff in here, but I wanted to pick on the first lesson: The riskiness of a mitigation should scale with the severity of the outage. Here are some excerpts from the article (emphasis mine)

Let’s start back in 2016, when YouTube was offering your favorite videos such as “Carpool Karaoke with Adele” and the ever-catchy “Pen-Pineapple-Apple-Pen.” YouTube experienced a fifteen-minute global outage, due to a bug in YouTube’s distributed memory caching system, disrupting YouTube’s ability to serve videos.

We, here in SRE, have had some interesting experiences in choosing a mitigation with more risks than the outage it’s meant to resolve. During the aforementioned YouTube outage, a risky load-shedding process didn’t fix the outage… it instead created a cascading failure.

We learned the hard way that during an incident, we should monitor and evaluate the severity of the situation and choose a mitigation path whose riskiness is appropriate for that severity.

The question I had reading this was: how did the authors make the judgment that the load-shedding mitigation was risky? In particular, how was the risk of the mitigation perceived in the moment? Note: this question is still relevant, even if the authors/contributors were the actual responders!

When a bad outcome happens, it’s easy to say with hindsight that the action was risky. But we can really only judge the riskiness based on what was understood by the operators at the time they had to make the call. As the good Dr. Cook noted in the endlessly quotable How Complex Systems Fail, all practitioner actions are gambles:

After accidents, the overt failure often appears to have been inevitable and the
practitioner’s actions as blunders or deliberate willful disregard of certain impending failure. But all practitioner actions are actually gambles, that is, acts that take place in the face of uncertain outcomes. The degree of uncertainty may change from moment to moment. That practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.

I have no firsthand knowledge of this particular incident. But, just as nobody ever wakes up and says “I’m going to do a bad job today”, nobody wakes up and says “I’m going to take unnecessary risks today.” Doing operations work means making risk trade-offs under uncertainty. We generally don’t know in advance how risky a particular mitigation will be. I think the real lessons is to recognize the inherent challenge that operators face in these scenarios.

One thought on “You can’t judge risk in hindsight

Leave a comment