Why do config changes keep coming up in major incidents?

Recently, Vijay Chidambaram (a CS professor at UT Austin) asked me, “Why do so many outages involve configuration changes?”

Hypothesis: config changes are more dangerous than code changes.
— Lorin Hochstein (@norootcause) October 6, 2017

Me, a few years ago, making a similar observation

I didn’t have a good explanation for him, and I still don’t. I’m using this post as an exercise of thinking out loud about possible explanations for this phenomenon.

It’s an illusion

It might be that config changes are not somehow more dangerous, it just seems like they are. Perhaps we only notice the writeups where a config change is mentioned, but we don’t remember the writeups that don’t involve a config change. Or perhaps it’s a base rate illusion, where config changes tend to be involved in incidents more often than code changes simply because config changes are more common than code changes.

I don’t believe this hypothesis: I think the config change effect is a real one.

Config changes as second-class

In the recent Salesforce incident, the writeup noted that:

For many of Salesforce’s systems, the deployment pipelines have built-in stagger and canary requirements that are automated. For Salesforce’s DNS systems, the automation and enforcement of staggering through technology is still being built. For this configuration change and script, the stagger process was still manual.

If an organization has the ability to stage their changes across different domains, I’d wager heavily that they supported staged code deployments before they supported staged configuration change. That’s certainly true at Netflix, where Spinnaker had support for regional rollout of code changes well before it had support for regional rollout of config changes.

This one feels like a real contributor to me. I’ve found that deployment tooling tends to support code changes better than config change: there’s just more engineering effort put into making code changes safer.

Config changes are hard to stage

In the case of the Salesforce incident, the configuration change could theoretically have been staged. However, it may be that configuration changes by their nature are harder to roll out in a staged fashion. Configuration is more likely to be inherently global than code.

I’m really not sure about this one. I have no sense as to how many config changes can be staged.

Config changes are hard to test

Have you ever written a unit test for a configuration value? I haven’t. It might be that config-change related problems only manifest when deployed into a production environment, so you couldn’t catch them at a smaller scope like a unit test.

I suspect this hypothesis plays a significant role as well.

Mature systems are more config-driven

Perhaps the sort of systems that are involved in large-scale outages at big tech companies are the more mature, reliable systems. These are the types of software that have evolved over time to enable operators to control more of their behavior by specifying policy in configuration.

This means that an operator is more likely to be able to achieve a desired behavior change via config versus code. And that sounds like a good thing. We all know that hard-coding things is bad, and changing code is dangerous. In the limit, we wouldn’t have to make any code changes at all to achieve the desired system behavior.

So, perhaps the fact that config changes are more commonly implicated in large-scale outages is a sign of the maturity of the systems?

I have no idea about this one. It seems like a clever hypothesis, but perhaps it’s too clever.

5 thoughts on “Why do config changes keep coming up in major incidents?”

Pingback: Links – Christian Leskowsky
Tom says:

June 4, 2021 at 10:14 am

Are feature flags changes considered config changes? If that is the case, any issue due to a code change behind a feature flag is likely to show up as confi change causing the issue. This is really a special case of your last reason.

Sun Chasing says:

September 30, 2022 at 5:01 pm

Thanks for writingg

Jeff Grigg says:

July 21, 2024 at 5:29 pm

I have written xUnit tests against configuration files. So I’ve had to think about this and work out details.

It’s essentially impossible to test production configurations completely without deploying them into production, because the difference between test and production environments is the differences in configuration (files).

It helps to set up “near production” systems that test *most* of the production functionality.

When you have multiple (or A Great Many!) production systems, it’s a good idea to roll out to some and see how well it works (or not) before rolling out to the vast majority.

It really helps to have reliable low-level facilities to roll it back to the last known good state.

(Even that is sometimes not sufficient, as the some changes make the system *Vulnerable* to failure. So the failure may not occur or be visible until “N” changes later. As in the AT&T long distance outage — due to an uninitialized variable in *extreme heavy load* conditions.)

AJ says:

July 22, 2024 at 12:00 pm

I worked at a big, enterprise-y company where the higher-ups decided as much as possible needed to be moved into Vault to reduce time to deploy down to essentially `kill -HUP …`

Why do config changes keep coming up in major incidents?

It’s an illusion

Config changes as second-class

Config changes are hard to stage

Config changes are hard to test

Mature systems are more config-driven

Published by Lorin Hochstein

5 thoughts on “Why do config changes keep coming up in major incidents?”

Leave a comment Cancel reply

It’s an illusion

Config changes as second-class

Config changes are hard to stage

Config changes are hard to test

Mature systems are more config-driven

Share this:

Published by Lorin Hochstein

5 thoughts on “Why do config changes keep coming up in major incidents?”

Leave a comment Cancel reply