Operator fault tolerance

Because “cloud” has become such a buzzword, it’s tempting to dismiss cloud computing as nothing new. But one genuine change is the rise in software designed to work in an environment where hardware failures are expected. The classic example of this trend is the Netflix Chaos Monkey, which tests a software system by initiating random failures. The IT community calls this sort of system “highly available”, whereas the academic community prefers the term “fault tolerant”.

If you plan to deploy a system like an OpenStack cloud, you need to be aware of the failure modes of the system components (disk failures, power failures, networking issues), and ensure that your system can stay functional when these failures occur. However, when you actually deploy OpenStack on real hardware, you quickly discover that the component that is most likely to generate a fault is you, the operator. Because every installation is different, and because OpenStack has so many options, the probability of forgetting an option or specifying the incorrect value in a config file on the initial deployment is approximately one.

And while developers now design software to minimize the impact due to a hardware failure, there is no such notion of minimizing the impact due to an operator failure. This would require asking questions at development time such as: “What will happen if somebody puts ‘eth1’ instead of ‘eth0’ for public_interface in nova.conf? How would they determine what has gone wrong?”

Designing for operator fault tolerance would be a significant shift in thinking, but I would wager that the additional development effort would translate into enormous reductions in operations effort.

2 thoughts on “Operator fault tolerance

  1. anyone who’s done a form-intensive web application knows about the level of effort and requisite depth of validation, but also knows how many inane email and support requests it avoids.

  2. suppose that in the case of deploying IaaS and PaaS platforms that test harnesses or OpenStack option-level (aka unit) tests executed in a post-configuration context may be more easily achieved than doing real-time or pre-configuration validation…

Leave a comment