In 1996, the Turing-award-winning computer scientist C.A.R. Hoare wrote a paper with the title How Did Software Get So Reliable Without Proof? In this paper, Hoare grapples with the observation that software seems to be more reliable than computer science researchers expected was possible without the use of mathematical proofs for verification (emphasis added):
Twenty years ago it was reasonable to predict that the size and ambition of software products would be severely limited by the unreliability of their component programs … Dire warnings have been issued of the dangers of safety-critical software controlling health equipment, aircraft, weapons systems and industrial processes, including nuclear power stations … Fortunately, the problem of program correctness has turned out to be far less serious than predicted …
So the questions arise: why have twenty years of pessimistic predictions been falsified? Was it due to successful application of the results of the research which was motivated by the predictions? How could that be, when clearly little software has ever has been subjected to the rigours of formal proof?
Hoare offers five explanations for how software became more reliable: management, testing, debugging, programming methodology, and (my personal favorite) over-engineering.
Looking back on this paper, what strikes me is the absence of acknowledgment of the role that human operators play in the types of systems that Hoare writes about (health equipment, aircraft, weapons systems, industrial processes, nuclear power). In fact, the only time the word “operator” appears in the text is when it precedes the word “error” (emphasis added)
The ultimate and very necessary defence of a real time system against arbitrary hardware error or operator error is the organisation of a rapid procedure for restarting the entire system.
Ironically, the above line is the closest Hoare gets to recognizing the role that humans can play in keeping the system running.
The problem with the question “How did software get so reliable without proof?” is that it’s asking the wrong question. It’s not that software got so reliable without proof: it’s that systems that include software got so reliable without proof.
By focusing only on the software, Hoare missed the overall system. And whether you call them socio-technical systems, software-intensive systems, or joint cognitive systems, if you can’t see the larger system, you are doomed to not even be able to ask the right questions.
7 thoughts on “How did software get so reliable without proof?”
I would attribute it to rapid iteration. Compare it to the design of a physical mechanism like a toaster, a refrigerator, or any of the various components of a car: These things require extensive up-front design, because once you start designing dies and tooling and processes, it can be too late to change aspects of the design without significant cost. Software, on the other hand, can iterate much more quickly without cost. You design something, you build it, you test it, you go back to the design phase and all in a matter of minutes. Can you imagine the process to design website UI if you had to send your wireframes down to fabrication and wait a week to get back a prototype?
A lot of bugs and problems are found by the developers themselves during local testing, more still by QA efforts and even more by users during beta. Any bugs or errors or even inconveniences are found, reported, and fixed next cycle for nothing more than the cost of download bandwidth and time to run the update. It is precisely this rapid prototyping and quick iteration that is the reason why we make things out of software that we used to make out of hardware, and it’s precisely this why the quality of software improves so much in a short time.
I have to disagree.
Hoare posed this question in 1996 and refers to software written up until 20 years before that.
I am pretty sure “rapid iteration” was non existent back then, let alone downloads internet or anything similar.
What they had tons off though was over-engineering and super engaged engineers that chose the profession out of love and not for status/money as it is now the case.
BTW rapid iteration/agility is not intended to increase the reliability of a system. Its intended to maximize the value for money on software projects by for controlling and relocating money to features/modules/whatever is most important at each time.
Ask the DevOps guys how much reliable are our software. If we remove the DevOps from the equation, its broken. Good point.
No the question is REALLY how did software get so reliable without proofs?
We have not had proofs of tcp ip, etc etc, and the CRAZY amount of stuff which is run over it is bonkers.
It is not socio-technical systems, software-intensive systems, or joint cognitive systems
We don’t have people checking each packet to see if it showed up.
You have started swimming in the systems which are reliable to the degree which you stopped recognising that they are there.
I think you’re underplaying the amount of human action involved in keeping networks (including all aspects of the internet) functioning effectively.
It’s interesting to ask what is the role of human action to system reliability. A good thought experiment would be to consider how long various systems would continue operate without human intervention. Some things would stop working. Many things would keep working for years. From a programming/hardware perspective, it’s interesting that the latter exists at all.
Good observation on what the real question is, systems vs. software. Another aspect is that many of our systems rely on some combination of fault tolerance, redundancy, and correctness proofs. For example, some cryptography has correctness/security proofs, but the system its deployed in may be too complex overall for proofs, so it’s built to be fault-tolerant and/or redundant instead. Then you have a system that relies on all three methodologies, main it more reliable than Hoare anticipated.
For example, TCP/IP is designed to survive massive disruptions to its network, automatically routing around damage. Modern large-scale distributed systems like Google’s infrastructure anticipate hardware failures and are designed to accommodate them without the entire system going down. Airplanes, rockets and satellites are built with multiple redundant systems, etc. That, plus kaizen via rapid iteration as mentioned above probably explains part of it.