Operator fault tolerance

Because “cloud” has become such a buzzword, it’s tempting to dismiss cloud computing as nothing new. But one genuine change is the rise in software designed to work in an environment where hardware failures are expected. The classic example of this trend is the Netflix Chaos Monkey, which tests a software system by initiating random failures. The IT community calls this sort of system “highly available”, whereas the academic community prefers the term “fault tolerant”.

If you plan to deploy a system like an OpenStack cloud, you need to be aware of the failure modes of the system components (disk failures, power failures, networking issues), and ensure that your system can stay functional when these failures occur. However, when you actually deploy OpenStack on real hardware, you quickly discover that the component that is most likely to generate a fault is you, the operator. Because every installation is different, and because OpenStack has so many options, the probability of forgetting an option or specifying the incorrect value in a config file on the initial deployment is approximately one.

And while developers now design software to minimize the impact due to a hardware failure, there is no such notion of minimizing the impact due to an operator failure. This would require asking questions at development time such as: “What will happen if somebody puts ‘eth1’ instead of ‘eth0’ for public_interface in nova.conf? How would they determine what has gone wrong?”

Designing for operator fault tolerance would be a significant shift in thinking, but I would wager that the additional development effort would translate into enormous reductions in operations effort.

Payroll systems, not yet a solved problem

I’m simultaneously unsurprised and shocked about this story about how SAP failed to deliver a payroll system that could properly handle 1300 employees, after the state of California spent $50 million on system development. We’ve been building payroll systems for decades now, and I believe that SAP is the largest software company on the planet that builds these kinds of systems.

This is a useful counterweight to Bertrand Meyer’s recent blog about how most of the software we interact with on a daily basis works well. He’s right, but we must also avoid falling prey to survivorship bias.

ESEM 2013 Industry Track CFP

The Call for Papers for the Industry Track of the International Symposium on Empirical Software Engineering and Measurement (ESEM 2013) is out. I’m serving as chair of the industry track this year.

If you’re reading this and you work in the software development world (and especially if you’re in the Baltimore/DC area), I encourage you to submit a paper that you think would be of interest to software engineering researchers or other developers.

I have a strong suspicion that the software engineering research community doesn’t have a good sense of the kinds of problems that software developers really face. What I’d really like to do with the industry track is bring professional developers and software engineering researchers together to talk about these sorts of problems.

Also, if you’re reading this and you live in the software world, I encourage you to check out what ESEM is about, even if you’re not interested in publishing a paper. This is a conference that’s focused on empirical study and measurement. If you ask me, every software engineering conference should be focused on empirical study. Because, you know, science.