Slack’s Jan 2021 outage: a tale of saturation

Laura Nolan of Slack recently published an excellent write-up of their Jan. 4, 2021 outage on Slack’s engineering blog. One of the things that struck me about this writeup is the contributing factors that aren’t part of this outage. There’s nothing about a bug that somehow made its way into a production, or an accidentally … Continue reading Slack’s Jan 2021 outage: a tale of saturation

Incident categories I’d like to see

If you’re categorizing your incidents by cause, here are some options for causes that I’d love to see used. These are all taken directly from the field of cognitive systems engineering research. Production pressure All of us are so often working near saturation: we have more work to do than time to do it. As … Continue reading Incident categories I’d like to see

Burned by ‘let it burn’

Here are some excerpts from a story from the L.A. Times, with the headline: Forest Service changes ‘let it burn’ policy following criticism from western politicians (emphasis mine) Facing criticism over its practice of monitoring some fires rather than quickly snuffing them out, the U.S. Forest Service has told its firefighters to halt the policy … Continue reading Burned by ‘let it burn’

Uber’s adventures in the adaptive universe

It’s 2016, and Uber engineers are facing a problem. Their software system has become brittle: many in the organization feel that it’s too hard to make changes to it without breaking things. And so, they adapt: they build a new architecture, one that’s designed to enable teams to move more quickly. As part of the … Continue reading Uber’s adventures in the adaptive universe