The danger of “insufficient virtue”

Nate Dimeo hosts a great storytelling podcast called The Memory Palace, where each episode is a short historical vignette. Episode 316: Ten Fingers, Ten Toes is about how people have tried to answer the question: “why are the bodies of some babies drastically different from the bodies of all others?”

The stories in this podcast usually aren’t personal, but this episode is an exception. Dimeo recounts how his great-aunt, Anna, was born without fingers on her left hand. Anna’s mother (Dimeo’s great-grandmother) blamed herself: when pregnant, she had been startled by a salesman knocking on the back door, and had bitten her knuckles. She had attributed the birth defect to her knuckle-biting.

We humans seem to be wired to attribute negative outcomes to behaving insufficiently virtuously. This is particularly apparent in the writing style of many management books. Here are some quotes from a book I’m currently reading.

For years, for example, American manufacturers thought they had to choose between low cost and high quality… They didn’t realize that they could have both goals, if they were willing to wait for one while they focused on the other.
…
Whenever a company fails, people always point to specific events to explain the “causes” of the failure: product problems, inept managers, loss of key people, unexpectedly aggressive competition, or business downturns. Yet, the deeper systemic causes for unsustained growth go unrecognized.
…
Why wasn’t that balancing process noticed? First, WonderTech’s financially oriented top management did not pay much attention to their delivery service. They mainly tracked sales, profits, return on investment, and market share. So long as these were healthy, delivery times were the least of their concerns.
…
Such litanies of “negative visions” are sadly commonplace, even among very successful people. They are the byproduct of a lifetime of fitting in, of coping, of problem solving. As a teenager in one of our programs once said, “We shouldn’t call them ‘grown ups’ we should call them ‘given ups.’
Peter Senge, The Fifth Discipline

In this book (The Fifth Discipline), Senge associates the principles he is advocating for (e.g., systems thinking, personal mastery, shared vision) with virtue, and the absence of these principles with vice. The book is filled with morality tales of the poor fates of companies due to insufficiently virtuous executives, to the point where I feel like I’m reading Goofus and Gallant comics.

This type of moralized thinking, where poor outcomes are caused by insufficiently virtuous behavior, is a cancer on our ability to understand incidents. It’s seductive to blame an incident on someone being greedy (an executive) or sloppy (an operator) or incompetent (a software engineer). Just think back to your reactions to incidents like the Equifax Data Breach or the California wildfires.

The temptation to attribute responsibility when bad things happen is overwhelming. You can always find greed, sloppiness, and incompetence if that’s what you’re looking for. We need to fight that urge. When trying to understand how an incident happened, we need to assume that all of the people involved were acting reasonably given the information they had the time. It means the difference between explaining incidents away, and learning from them.

(Oh, and you’ll probably want to check out the Field Guide to Understanding ‘Human Error’ by Sidney Dekker).

Notes on David Woods’s Resilience Engineering short course

David Woods has a great series of free online lectures on resilience engineering. After watching those lectures, a lot of the material clicked for me in a way that it never really did from reading his papers.

Woods writes about systems at a very general level: the principles he describes could apply to cells, organs, organisms, individuals, teams, departments, companies, ecosystems, socio-technical systems, pretty much anything you could describe using the word “system”. This generality means that he often uses abstract concepts, which apply to all such systems. For example, Woods talks about units of adaptive behavior, competence envelopes, and florescence. Abstractions that apply in a wide variety of contexts are very powerful, but reading about them is often tough going (cf. category theory).

In the short course lectures, Woods really brings these concepts to life. He’s an animated speaker (especially when you watch him at 2X speed). It’s about twenty hours of lectures, and he packs a lot of concepts into those twenty hours.

I made an effort to take notes as I watched the lectures. I’ve posted my notes to GitHub. But, really, you should watch the videos yourself. It’s the best way to get an overview about what resilience engineering is all about.

Our brittle serverless future

I’m really enjoying David Woods’s Resilience Engineering short course videos. In Lecture 9, Woods mentions an important ingredient in a resilient system: the ability to monitor how hard you are working to stay in control of the system.

I was thinking of this observation in the context of serverless computing. In serverless, software engineers offload the responsibility of resource management to a third-party organization, who handles this transparently for them. No more thinking in terms of servers, instance types, CPU utilization and memory usage!

The challenge is this: from the perspective of a customer of a serverless provider, you don’t have visibility into how hard the provider is working to stay in control. If the underlying infrastructure is nearing some limit (e.g., amount of incoming traffic it can handle), or if it’s operating in degraded mode because of an internal failure, these challenges are invisible to you as a customer.

Woods calls this phenomenon the veil of fluency. From the customer’s perspective, everything is fine. Your SLOs are all still being met! However, from the provider’s perspective, the system may be very close to the boundary, the point where it falls over.

Woods also talks about the importance of reciprocity in resilient organizations: how different units of adaptive behavior synchronize effectively when a crunch happens and one of them comes under pressure. In a serverless environment, you lose reciprocity because there’s a hard boundary between the serverless provider and a customer. If your system is deployed in a serverless environment, and a major incident happens where the serverless system is a contributing factor, nobody from your serverless provider is going to be in the Slack channel or on the conference bridge.

I think Simon Wardley is correct in his prediction that serverless is the future of software deployment. The tools are still immature today, but they’ll get there. And systems built on serverless will likely be more robust, because the providers will have more expertise in resource management and fault tolerance than their customers do.

But every system eventually reaches its limit. One day a large-scale serverless-based software system is going to go past the limit of what it can handle. And when it breaks, I think it’s going to break quickly, without warning, from the customer’s perspective. And you won’t be able to coordinate with the engineers at your serverless provider to bring the system back into a good state, because all you’ll have are a set of APIs.