The system is in trouble. Maybe a network link has gotten saturated, or a bad DNS configuration got pushed out. Maybe the mix of incoming requests suddenly changed and now there are a lot more heavy requests than light ones, and autoscaling isn’t helping. Perhaps a data feed got corrupted and there’s no easy way to bring the affected nodes back into a good state.
Whatever the specific details are, the system has encountered a situation that it wasn’t designed to handle. This is when the alerts go off and the human operators get involved. The operators work to reconfigure the system to get through the trouble. Perhaps they manually scale up a cluster that doesn’t scale automatically, or they recycle nodes, or make some configuration change or redirect traffic to relieve pressure from some aspect of the system.
If we think about the system in terms of the computer-y parts, the hardware and the software, then it’s clear that the system couldn’t handle this new failure mode. If it could, the humans wouldn’t have to get involve.
We can broaden our view of the system to also include the humans, sometimes known as the socio-technical system. In some cases, the socio-technical system is actually designed to handle cases that the software system alone can’t: these are the scenarios that we document in our runbooks. But, all too often, we encounter a completely novel failure mode. For the poor on-call, there’s no entry in the runbook that describes the steps to solve this problem.
In cases where the failure is completely novel, the human operators have to improvise: they have to figure out on the fly what to do, and then make the relevant operational changes to the system.
If the operators are effective, then even though the socio-technical system wasn’t designed to function properly in this face of this new kind of trouble, the people within the system make changes that result in the overall system functioning properly again.
It is this capability of a system, its ability to change itself when faced with a novel situation in order to deal effectively with that novelty, that David Woods calls graceful extensibility.
Here’s how Woods defines graceful extensibility in his paper: The Theory of Graceful Extensibility: Basic rules that govern adaptive systems:
Graceful extensibility is the opposite of brittleness, where brittleness is a sudden collapse or failure when events push the system up to and beyond its boundaries for handling changing disturbances and variations. As the opposite of brittleness, graceful extensibility is the ability of a system to extend its capacity to adapt when surprise events challenge its boundaries.
This idea is a real conceptual leap for those of us in the software world, because we’re used to thinking about the system only as the software and the hardware. The idea of a system like that adapting to a novel failure mode is alien to us, because we can’t write software that does that. If we could, we wouldn’t need to staff on-call rotations.
We humans can adapt: we can change the system, both the technical bits (e.g., changing configuration) and the human bits (e.g., changing communication patterns during an incident, either who we talk to or the communication channel involved).
However, because we don’t think of ourselves as being part of the system, when we encounter a novel failure mode, and then the human operators step in and figure out how to recover, our response is typically, “the system could not handle this failure mode (and so humans had to step in)”.
In one sense, that assessment is true: the system wasn’t designed to handle this failure mode. But in another sense, when we expand our view of the system to include the people, an alternate response is, “the system encountered a novel failure mode and we figured out how to make operational changes to make the system healthy again.”
We hit the boundary of what our system could handle, and we adapted, and we gracefully extended that boundary to include this novel situation. Our system may not be able to deal with some new kind of trouble. But, if the system has graceful extensibility, then it can change itself when the new trouble happens so it can deal with the trouble.
Graceful extensibility is *exactly* what Nassim Taleb calls Antifragility: https://en.wikipedia.org/wiki/Antifragility. I don’t believe David Woods is unaware of it, but it’s strange that he didn’t mention it in the paper.
Given that Antifragile was published in 2012, and David Woods has been doing work in this area for decades, what I wonder is: why does Taleb never mention Woods (or any of the research into resilience engineering) at all?
I don’t want to defend either of them. Taleb has also been publishing work in this area for two decades (but this is true that he barely mentions anyone in his work). Woods’ paper introducing “graceful extensibility” specifically came out in 2018, well after Antifragile, I think it would be more productive to stick to an already existing term than breeding terminology. I don’t think it would take anything from the Woods work.