The inherent weirdness of system behavior

All implementations of mutable state in a geographically distributed system are either slow (require coordination when updating data) or weird (provide weak consistency only).

Sebastian Burckhardt, Principles of Eventual Consistency

The Generalized Uncertainty Principle (G.U.P.): Systems display antics. Alternatively: Complex systems exhibit unexpected behavior.

John Gall, The Systems Bible

When systems or organizations don’t work the way you think they should, it is generally not because the people in them are stupid or evil. It is because they are operating according to structures and incentives that aren’t obvious from the outside.

Jennifer Pahlka, Recoding America

It is also counterproductive by encouraging researchers and consultants and organizations to treat errors as a thing associated with people as a component — the reification fallacy (a kind of over-simplification), treating a set of interacting dynamic processes as if they were a single process.

David Woods, Sidney Dekker, Richard Cook, Leila Johannensen, Nadine Sarter, Behind Human Error

We humans solve problems by engineering systems. In a sense, a system is the opposite of a classical atom. Where an atom was conceived of as an indivisible entity, a system is made up of a set of interacting components. These components are organized in such a way that the overall system accomplishes a useful set of functions as conceived of by the designers.

Unfortunately, it’s impossible to build a perfect complex system. It’s also the case that we humans are very bad at reasoning about the behavior of unfamiliar complex systems when they deviate from our expectations.

The notion of consistency in distributed systems are a great example of this. Because distributed systems are, well, systems, that can exhibit behaviors that wouldn’t happen with atomic systems. The most intuitive notion of consistency, called linearizability, basically means “this concurrent data structure behaves the way you would expect a sequential data structure works”. And linearizability doesn’t even encompass everything! It’s only meaningful if there is a notion of a global clock (which isn’t the case in a distributed system), and it also only covers the case of single objects, which means it doesn’t cover transactions across multiple objects However, ensuring linearizability is difficult enough that we typically need to relax our consistency requirements when we build distributed systems, which means we need to choose a weaker model.

What I love about consistency models is that they aren’t treated as correctness models. Instead, they’re weirdness models: different levels of consistency will violate our intuitions relative to linearizability, and we need to choose what level of weirdness that we can actually implement and that is good enough for our application.

These sorts of consistency problems, where systems exhibit behaviors that violate our intuitions, is not specific to distributed software systems. In some cases, the weirdness of the system behavior leads to a negative outcome, the sort of thing that we call an incident. Often the negative outcome is attributed to the behavior of an individual agent within the system, where it gets labeled as “human error”. But as Woods et al. point out in the quote above, this attribution is based on an incorrect assumption on how systems actually behave.

The problem isn’t the people within the system. The weirdness arises from the interactions.

Leave a comment