A conjecture on why reliable systems fail

(Some of my co-workers call this Lorin’s Law)

Even highly reliable systems go down occasionally. After having read over the details of several incidents, I’ve started to notice a pattern, which has led me to the following conjecture:

Once a system reaches a certain level of reliability, most major incidents will involve:

A manual intervention that was intended to mitigate a minor incident, or
Unexpected behavior of a subsystem whose primary purpose was to improve reliability

Here are three examples from Amazon’s post-mortem write-ups of major AWS outages:

The S3 outage on February 28, 2017 involved a manual intervention to debug an issue that was causing the S3 billing system to progress more slowly than expected.

The DynamoDB outage on September 20, 2015 (which also affected SQS, auto scaling, and CloudWatch) involved healthy storage servers taking themselves out of service by executing a distributed protocol that was (presumably) designed that way for fault tolerance.

The EBS outage on October 22, 2012 (which also affected EC2, RDS, and ELBs) involved a memory leak bug in an agent that monitors the health of EBS servers.

6 thoughts on “A conjecture on why reliable systems fail”

In the fun book “The systems bible” (which I recommend) it is observed that (among other antics), “Fail-safe systems fail by failing unsafely”.

It is quite frequent that safety in a system is obtained by pulling in some kind of subsystem designed for that goal. Therefore, when the subsystem fails… I suppose you can design a system such that safety is derived from how the system is structured (rather than added as an extra), but I’m not sure if you can do that on purpose!

I’m curious about the status of the conjecture now, four years later.

I think it has held up pretty well.

Pingback: The ambiguity of real work – Surfing Complexity

Pingback: Your lying virtual eyes – Surfing Complexity

Pingback: Quick takes on the recent OpenAI public incident write-up – Surfing Complexity

foo says:

October 19, 2017 at 7:23 pm

In the fun book “The systems bible” (which I recommend) it is observed that (among other antics), “Fail-safe systems fail by failing unsafely”.

It is quite frequent that safety in a system is obtained by pulling in some kind of subsystem designed for that goal. Therefore, when the subsystem fails… I suppose you can design a system such that safety is derived from how the system is structured (rather than added as an extra), but I’m not sure if you can do that on purpose!

tristanls says:

April 20, 2021 at 4:48 am

I’m curious about the status of the conjecture now, four years later.

Lorin Hochstein says:

April 20, 2021 at 9:18 am

I think it has held up pretty well.

Pingback: The ambiguity of real work – Surfing Complexity
Pingback: Your lying virtual eyes – Surfing Complexity
Pingback: Quick takes on the recent OpenAI public incident write-up – Surfing Complexity

A conjecture on why reliable systems fail

Published by Lorin Hochstein

6 thoughts on “A conjecture on why reliable systems fail”

Leave a reply to foo Cancel reply

Share this:

Published by Lorin Hochstein

6 thoughts on “A conjecture on why reliable systems fail”

Leave a reply to foo Cancel reply