Nobody likes a mess. Especially in the world of software engineering, we always strive to build well-structured systems. No one ever sets out to build a big ball of mud.
Alas, we are constantly reminded that the systems we work in are messier than we’d like. This messiness often comes to light in the wake of an incident, when we dig in to understand what happened. Invariably, we find that the people are a particularly messy part of the overall system, that the actions that they take contribute to incidents. In the wake of the incident, we identify follow-up work that we hope will bring more order, less mess, into our world. What we miss, though, is the role that the messy nature of our systems play in keeping things working.
When I use the term system here, I mean it in the broader sense of a socio-technical system that includes both the technological elements (software, hardware) and the humans involved, the operators in particular.
Yes, there are neat, well-designed structures in place that help keep our system healthy: elements that include automated integration tests, canary deployments, and staffed on-call rotations. But complementing those structures are informal layers of defense provided by the people in our system. These are the teammates who are not on-call but jump in to help, or folks who just happen to lurk in Slack channels and provide key context at the right moment, to either help diagnose an incident or prevent one from happening in the first place.
This informal, messy system of defense is like a dynamic, overlapping patchwork. And sometimes this system fails: for example, a person who would normally chime in with relevant information happens to be out of the office that day. Or, someone takes an action which, under typical circumstances, would be beneficial, but under the specific circumstances of the incident, actually made things worse.
We would never set out to design a socio-technical system the way our systems actually are. Yet, these organic, messy systems actually work better than the neat, orderly systems that engineers dream of, because of how the messy system leverages human expertise.
It’s tempting to bemoan messiness, and to always try to reduce it. And, yes, messiness can be an indicator of problems in the system: for example, people using workarounds instead of how the system was intended to be used are an example of a kind of messiness that points to a shortcoming in our system.
But the human messiness we see under the ugly light of failure is the messiness that actually helps keep the system up and running when that light isn’t shining. If we want to get better at keeping our systems up and running, we need to understand what the mess looks like when things are actually working. We need to learn to embrace the mess. Because there’s beauty in that mess, the beauty of a system that keeps on running day after day.