Recently, Vijay Chidambaram (a CS professor at UT Austin) asked me, “Why do so many outages involve configuration changes?”
I didn’t have a good explanation for him, and I still don’t. I’m using this post as an exercise of thinking out loud about possible explanations for this phenomenon.
It’s an illusion
It might be that config changes are not somehow more dangerous, it just seems like they are. Perhaps we only notice the writeups where a config change is mentioned, but we don’t remember the writeups that don’t involve a config change. Or perhaps it’s a base rate illusion, where config changes tend to be involved in incidents more often than code changes simply because config changes are more common than code changes.
I don’t believe this hypothesis: I think the config change effect is a real one.
Config changes as second-class
In the recent Salesforce incident, the writeup noted that:
For many of Salesforce’s systems, the deployment pipelines have built-in stagger and canary requirements that are automated. For Salesforce’s DNS systems, the automation and enforcement of staggering through technology is still being built. For this configuration change and script, the stagger process was still manual.
If an organization has the ability to stage their changes across different domains, I’d wager heavily that they supported staged code deployments before they supported staged configuration change. That’s certainly true at Netflix, where Spinnaker had support for regional rollout of code changes well before it had support for regional rollout of config changes.
This one feels like a real contributor to me. I’ve found that deployment tooling tends to support code changes better than config change: there’s just more engineering effort put into making code changes safer.
Config changes are hard to stage
In the case of the Salesforce incident, the configuration change could theoretically have been staged. However, it may be that configuration changes by their nature are harder to roll out in a staged fashion. Configuration is more likely to be inherently global than code.
I’m really not sure about this one. I have no sense as to how many config changes can be staged.
Config changes are hard to test
Have you ever written a unit test for a configuration value? I haven’t. It might be that config-change related problems only manifest when deployed into a production environment, so you couldn’t catch them at a smaller scope like a unit test.
I suspect this hypothesis plays a significant role as well.
Mature systems are more config-driven
Perhaps the sort of systems that are involved in large-scale outages at big tech companies are the more mature, reliable systems. These are the types of software that have evolved over time to enable operators to control more of their behavior by specifying policy in configuration.
This means that an operator is more likely to be able to achieve a desired behavior change via config versus code. And that sounds like a good thing. We all know that hard-coding things is bad, and changing code is dangerous. In the limit, we wouldn’t have to make any code changes at all to achieve the desired system behavior.
So, perhaps the fact that config changes are more commonly implicated in large-scale outages is a sign of the maturity of the systems?
I have no idea about this one. It seems like a clever hypothesis, but perhaps it’s too clever.
Are feature flags changes considered config changes? If that is the case, any issue due to a code change behind a feature flag is likely to show up as confi change causing the issue. This is really a special case of your last reason.
Thanks for writingg