Recently, Salesforce released a public incident writeup for a service outage that happened in mid-May. There’s a lot of good stuff in here (DNS! A config change!), but I want to focus on one aspect of the writeup, a contributing factor described in the writeup as Subversion of the Emergency Break Fix (EBF) process.
Here are some excerpts from that section of the writeup (emphasis in the original):
An [Emergency Break Fix] is an unplanned and urgent change that is required to prevent or remediate a Severity-0, a Severity-1, or a Severity-2 incident… Non-urgent changes, i.e. those which do not require immediate attention, should not be deployed as EBFs.
…In this situation, there was no active or imminent Severity-0, Severity-1 or Severity-2 incident, so the EBF process should not have been used, and standard Salesforce stagger processes should not have been ignored.
By following an emergency process, this change avoided the extensive review scrutiny that would have occurred had it been made as a standard change under the Salesforce Change Traffic Control (CTC) process. … In this case, the engineer subverted the known policy and the appropriate disciplinary action has been taken to ensure this does not happen in the future.
“What was the engineer thinking? “ a reader wonders. I certainly did. People make decisions for reasons that make sense to them. I have no idea what the engineer’s reasoning was here, because there’s not even a hint of that reasoning alluded to here.
Is this process commonly circumvented by engineers for some reason? (i.e., was this situation actually more common than the writeup lets on?) Alternately, was the engineer facing atypical time pressure? If so, what was the nature of the time pressure?
One of the functions of public writeups is to give customers confidence in the organization’s ability to deal with future incidents. This section had the opposite effect, it filled me with dread. It communicates to me that the organization is not interested in understanding how actual work is done.
Woe be it to the next engineer caught in the double bind where there will be consequences if they don’t work quickly enough and there will be consequences if they don’t conform to a process that slows them down so much that they can’t get their work done quickly enough.