Back in July, Ray Ashman at Mailchimp posted a wonderful writeup of an internal incident (h/t to SRE Weekly). It took the Mailchimp engineers almost two days to make sense of the failure mode.
The trigger was a change to a logging statement, in order to log an exception. During the incident, the engineers noticed that this change lined up with the time that the alerts fired. But, other than the timing, there wasn’t any evidence to suggest that the log statement change was problematic. The change didn’t have any apparent relationship to the symptoms they were seeing with the job runner, which was in a different part of the codebase. And so they assumed that the logging statement change was innocuous.
As it happened, there was a coupling between that log statement and the job runner. Unfortunately for the engineers, this coupling was effectively invisible to them. The connection between the logging statement and the job running was Mailchimp’s log processing pipeline. Here’s an excerpt from the writeup (emphasis mine):
Our log processing pipeline does a bit of normalization to ensure that logs are formatted consistently; a quirk of this processing code meant that trying to log a PHP object that is Iterable would result in that object’s iterator methods being invoked (for example, to normalize the log format of an Array).
Normally, this is an innocuous behavior—but in our case, the harmless logging change that had shipped at the start of the incident was attempting to log PHP exception objects. Since they were occurring during job execution, these exceptions held a stacktrace that included the method the job runner uses to claim jobs for execution (“locking”)—meaning that each time one of these exceptions made it into the logs, the logging pipeline itself was invoking the job runner’s methods and locking jobs that would never be actually run!
Fortunately, there were engineers who had experience with this failure mode before:
Since the whole company had visibility into our progress on the incident, a couple of engineers who had been observing realized that they’d seen this exact kind of issue some years before.
…
Having identified the cause, we quickly reverted the not-so-harmless logging change, and our systems very quickly returned to normal.
In the moment, the engineers could not conceive of how a change in behavior in the job runner could be affected by the modification of a log statement in an unrelated part of the code. It was literally unthinkable to them.