Simplicity is prerequisite for reliability. — Edsger W. Dijkstra
Think about a system whose reliability had significantly improved over some period of time. The first example that comes to my mind is commercial aviation, but I’d encourage you to think of a software system you’re familiar with, either as a user (e.g., Google, AWS) or as a maintainer of a system that’s gotten more reliable over time.

Now, for the system you have thought about where its reliability increased over time, think about what the complexity trend looks like over time for that system. I’d wager you’d see a similar sort of trend.

Now, in general, increases in complexity don’t lead to increases in reliability. In some cases, engineers make a deliberate decision to trade off reliability for new capabilities. The telephone system today is much less reliable than it was when I was younger. As someone who grew up in the 80s and 90s, the phone system was so reliable that it was shocking to pick up the phone and not hear a dial tone. We were more likely to experience a power failure than a telephony outage, and the phones still worked when the power was out! I don’t think we even knew the term “dropped call”. Connectivity issues with cell phones are much more common than they ever were with landlines. But this was a deliberate tradeoff: we gave up some reliability in order to have ubiquitous access to a phone.
Other times, the increase in complexity isn’t the product of an explicit tradeoff but rather an entropy-like effect of a system getting more difficult to deal with over time as it accretes changes. This scenario, the one that most people have in mind when they think about increasing complexity in their system, is synonymous with the idea of tech debt. With tech debt the increase in complexity makes the system less reliable, because the risk of making a breaking change in the system has increased. I started this blog post with a quote from Dijkstra about simplicity. Here’s another one, along the same lines, from C.A.R. Hoare’s Turing Award Lecture in 1980:
There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.
What Dijkstra and Hoare are saying is: the easier a software system is to reason about, the more likely it is to be correct. And this is true: when you’re writing a program, the simpler the program is, the more likely that you are to get it right. However, as we scale up from individual programs to systems, this principle breaks down. Let’s see how that happens.
Djikstra claims simplicity is a prerequisite for reliability. According to Dijkstra, if we encounter a system that’s reliable, it must be a simple system, because simplicity is required to achieve reliability.
reliability ⇒ simplicity
The claim I’m making in this post is the exact opposite: systems that improve in reliability do so by adding features that improves reliability, but come at the cost of increased complexity.
reliability ⇒ complexity
Look at classic works on improving the reliability of real-world systems like Michael Nygard’s Release It!, Joe Armstrong’s Making reliable distributed systems in the presence of software errors, and Jim Gray’s Why Do Computers Stop and What Can Be Done About It? and think about the work that we do to make our software systems more reliable, functionality like retries, timeouts, sharding, failovers, rate limiting, back pressure, load shedding, autoscaling, circuit breakers, transactions, and auxiliary systems we have to support our reliability work like an observability stack. All of this stuff adds complexity.
Imagine if I took a working codebase and proposed deleting all of the lines of code that are involved in error handling. I’m very confident that this deletion of code would make the codebase simpler. There’s a reason that programming books tend to avoid error handling cases in their examples, they do increase complexity! But if you were maintaining a reliable software system, I don’t think you’d be happy with me if I submitted a pull request that deleted all of the error handling code.
Let’s look at the natural world, where biology provides us with endless examples of reliable systems. Evolution has designed survival machines that just keep on going; they can heal themselves in simply marvelous ways. We humans haven’t yet figured out how to design systems which can recover from the variety of problems that a living organism can. Simple, though, they are not. They are astonishingly, mind-boggling-y complex. Organisms are the paradigmatic example of complex adaptive systems. However complex you think biology is, it’s actually even more complex than that. Mother nature doesn’t care that humans struggle to understand her design work.
Now, I’m not arguing that this reliability-that-adds-complexity is a good thing. In fact, I’m the first person who will point out that this complexity in service of reliability creates novel risks by enabling new failure modes. What I’m arguing instead is that achieving reliability by pursuing simplicity is a mirage. Yes, we should pay down tech debt and simplify our systems by reducing accidental complexity: there are gains in reliability to be had through this simplifying work. But I’m also arguing that successful systems are always going to get more complex over time, and some of that complexity is due to work that improves reliability. Successful reliable systems are going to inevitably get more complex. Our job isn’t to reduce that complexity, it’s to get better at dealing with it.
This I 100% agree with.
And this is where I wonder if we’re still comparing apples to apples.
When scaling up, aren’t we comparing systems that actually do different things, one doing strictly more than the other?
I would think that the original EWD quote still makes sense when comparing systems providing the same guarantees.
Anyway, this pithy quote has zero context (it’s on its own in an EWD full of aphorisms), so we can ascribe multiple meanings to it, and therefore take it with a grain of salt.
Some more Dijkstra quotes I think are just as relevant:
“Simplicity is a great virtue but it requires hard work to achieve it and education to appreciate it.”
“How do we convince people that in programming simplicity and clarity —in short: what mathematicians call “elegance”— are not a dispensable luxury, but a crucial matter that decides between success and failure?”
With all that said, I do agree with your conclusions: improving reliability usually requires solving more (different) problems, and thus increases overall complexity. That would not in and of itself be accidental complexity though, if we are indeed careful to keep the latter under control.
And I think it’s often worth asking the extra (and hard) question: if we add this feature, will we reach the point of diminishing or even negative returns?
If we’re doing quotes, I much prefer Gall’s Law:
Systemantics is slept on as a classic work.
It’s a book I keep recommending (I love the brief chapter on “Communication theory”), but I suspect it takes a rather uncommon kind of mind to appreciate it: its truth bombs can often sound like pure satire, but most often they are really a reflection of the chaos of the world and the flaws of human nature.
Indeed. It took me years before I realised it more fatalistic than satirical.
I wouldn’t be surprised if even in biology there are systems that are simple relative to other biological systems; and that more-complex biological system can be less reliable than the simpler systems.