March 2020 – Surfing Complexity

In 1996, the Turing-award-winning computer scientist C.A.R. Hoare wrote a paper with the title How Did Software Get So Reliable Without Proof? In this paper, Hoare grapples with the observation that software seems to be more reliable than computer science researchers expected was possible without the use of mathematical proofs for verification (emphasis added):

Twenty years ago it was reasonable to predict that the size and ambition of software products would be severely limited by the unreliability of their component programs … Dire warnings have been issued of the dangers of safety-critical software controlling health equipment, aircraft, weapons systems and industrial processes, including nuclear power stations … Fortunately, the problem of program correctness has turned out to be far less serious than predicted …
So the questions arise: why have twenty years of pessimistic predictions been falsified? Was it due to successful application of the results of the research which was motivated by the predictions? How could that be, when clearly little software has ever has been subjected to the rigours of formal proof?

Hoare offers five explanations for how software became more reliable: management, testing, debugging, programming methodology, and (my personal favorite) over-engineering.

Looking back on this paper, what strikes me is the absence of acknowledgment of the role that human operators play in the types of systems that Hoare writes about (health equipment, aircraft, weapons systems, industrial processes, nuclear power). In fact, the only time the word “operator” appears in the text is when it precedes the word “error” (emphasis added)

The ultimate and very necessary defence of a real time system against arbitrary hardware error or operator error is the organisation of a rapid procedure for restarting the entire system.

Ironically, the above line is the closest Hoare gets to recognizing the role that humans can play in keeping the system running.

The problem with the question “How did software get so reliable without proof?” is that it’s asking the wrong question. It’s not that software got so reliable without proof: it’s that systems that include software got so reliable without proof.

By focusing only on the software, Hoare missed the overall system. And whether you call them socio-technical systems, software-intensive systems, or joint cognitive systems, if you can’t see the larger system, you are doomed to not even be able to ask the right questions.