Subverting the process

Recently, Salesforce released a public incident writeup for a service outage that happened in mid-May. There’s a lot of good stuff in here (DNS! A config change!), but I want to focus on one aspect of the writeup, a contributing factor described in the writeup as Subversion of the Emergency Break Fix (EBF) process.

Here are some excerpts from that section of the writeup (emphasis in the original):

An [Emergency Break Fix] is an unplanned and urgent change that is required to prevent or remediate a Severity-0, a Severity-1, or a Severity-2 incident… Non-urgent changes, i.e. those which do not require immediate attention, should not be deployed as EBFs.

In this situation, there was no active or imminent Severity-0, Severity-1 or Severity-2 incident, so the EBF process should not have been used, and standard Salesforce stagger processes should not have been ignored. 

By following an emergency process, this change avoided the extensive review scrutiny that would have occurred had it been made as a standard change under the Salesforce Change Traffic Control (CTC) process. … In this case, the engineer subverted the known policy and the appropriate disciplinary action has been taken to ensure this does not happen in the future.

What was the engineer thinking? “ a reader wonders. I certainly did. People make decisions for reasons that make sense to them. I have no idea what the engineer’s reasoning was here, because there’s not even a hint of that reasoning alluded to here.

Is this process commonly circumvented by engineers for some reason? (i.e., was this situation actually more common than the writeup lets on?) Alternately, was the engineer facing atypical time pressure? If so, what was the nature of the time pressure?

One of the functions of public writeups is to give customers confidence in the organization’s ability to deal with future incidents. This section had the opposite effect, it filled me with dread. It communicates to me that the organization is not interested in understanding how actual work is done.

Woe be it to the next engineer caught in the double bind where there will be consequences if they don’t work quickly enough and there will be consequences if they don’t conform to a process that slows them down so much that they can’t get their work done quickly enough.

3 thoughts on “Subverting the process

  1. After reading the first two root causes, I think the EBF root cause is superfluous.

    I’m unclear on whether they’re calling the global change an EBF by definition or by procedure – it could be that they’re saying that all non-staggered changes are inherently EBF and subject to emergency procedures. Or perhaps the engineer had to activate some procedure to make an “emergency” change – this is unclear. But, I don’t think it’s terribly important: what IS clear is that staggering the release of this change required more manual work.

    Things that require more manual work are generally more error prone. If they had (for whatever reason) mistaken confidence in their change (or mistaken assessment of the damage the change could cause), they may have thought the risks of screwing up the manual staggered rollout were higher than the risks of doing the change once globally with global potential failure.

    I guess people like postmortem AIs but what I got out of their doc is they need to make the safer procedure the simpler procedure. Asking someone to undertake a more complex procedure in the name of “safety” isn’t a great strategy.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s