Incident analysis as guerrilla case study research

Today I tweeted this:

To which Sasha Rosenbaum asked:

This post is my response.

We seldom have time for introspection at work. If we’re lucky, we have the opportunity to do some kind of retrospective at the end of a project or sprint. But, generally speaking, we’re too busy working to spend time examining that work.

One exception to this is incidents: organizations are willing to spend effort on introspection after an incident happens. That’s because incidents are unsettling: people feel uneasy that the system failed in a way they didn’t expect.

And so, an organization is willing to spend precious engineering cycles in order to rid itself of the uneasy feeling that comes with a system failing unexpectedly. Let’s make sure this never happens again.

Incident analysis, in the learning from incidents in software (LFI) sense, is about using an incident as an opportunity to get a better understanding of how the overall system works. It’s a kind of case study, where the case is the incident. The incident acts as a jumping-off point for an analyst to study an aspect of the system. Just like any other case study, it involves collecting and synthesizing data from multiple sources (e.g., interviews, chat transcripts, metrics, code commits).

I call it a guerrilla case study because, from the organization’s perspective, the goal is really to get closure, to have a sense that all is right with the world. People want to get to a place where the failure mode is now well-understood and measures will be put in place to prevent it from happening again. As LFI analysts, we’re exploiting this desire for closure to justify spending time examining how work is really done inside of the system.

Ideally, organizations would recognize the value of this sort of work, and would make it explicit that the goal of incident analysis is to learn as much as possible. They’d also invest in other types of studies that look into how the overall system works. Alas, that isn’t the world we live in, so we have to sneak this sort of work in where we can.

The British General’s problem

Half a league, half a league,
Half a league onward,
All in the valley of Death
Rode the six hundred.
“Forward, the Light Brigade!
Charge for the guns!” he said:
Into the valley of Death
Rode the six hundred.

“Forward, the Light Brigade!”
Was there a man dismay’d?
Not tho’ the soldier knew
Some one had blunder’d:
Theirs not to make reply,
Theirs not to reason why,
Theirs but to do and die:
Into the valley of Death
Rode the six hundred.

Cannon to right of them,
Cannon to left of them,
Cannon in front of them
Volley’d and thunder’d;
Storm’d at with shot and shell,
Boldly they rode and well,
Into the jaws of Death,
Into the mouth of Hell
Rode the six hundred.

Flash’d all their sabres bare,
Flash’d as they turn’d in air
Sabring the gunners there,
Charging an army, while
All the world wonder’d:
Plunged in the battery-smoke
Right thro’ the line they broke;
Cossack and Russian
Reel’d from the sabre-stroke
Shatter’d and sunder’d.
Then they rode back, but not
Not the six hundred.

Cannon to right of them,
Cannon to left of them,
Cannon behind them
Volley’d and thunder’d;
Storm’d at with shot and shell,
While horse and hero fell,
They that had fought so well
Came thro’ the jaws of Death,
Back from the mouth of Hell,
All that was left of them,
Left of six hundred.

When can their glory fade?
O the wild charge they made!
All the world wonder’d.
Honor the charge they made!
Honor the Light Brigade,
Noble six hundred!

The Charge of the Light Brigade, Alfred Lord Tennyson

One of the challenges of building distributed systems is that communications channels are unreliable. This challenge is captured in what is known as the Two Generals’ Problem. It goes something like this:

Two generals from the same army are camped at opposite sides of a valley. The enemy troops are stationed in the valley. The generals need to coordinate their attack in order to succeed. Neither general will risk their troops unless they know for certain that the other general will attack at the same time.

They only mechanism the generals have to communicate is by sending messages via carrier pigeon that fly over the valley. The problem is that enemy archers in the valley are sometimes able to shoot down these carrier pigeons, and so a general sending a message has no guarantee that it will be received by the other general.

Two generals at opposing sides of a valley
One general sends a message to another across the valley via carrier pigeon

The first general sends the message: Let’s attack Monday at dawn, and waits for an acknowledgment. If he doesn’t get an acknowledgment, he can’t be certain the message was received, and so he can’t attack. He receives an acknowledgment in return. But now he thinks, “What if the other general doesn’t know I’ve received the acknowledgment? He knows I won’t carry out the attack unless I know for certain that he agrees to the plan, and he doesn’t know whether I’ve received an acknowledgment or not. I’d better send a message acknowledging the message.”

This becomes an infinite regress problem: it turns out that no number of acknowledgments and acknowledgment-of-acknowledgments that can ensure that the two generals have common knowledge (“I know that he knows that I know that he knows…”) when communicating over an unreliable channel.

One implicit assumption of the Two Generals’ Problem is that when a message is received, it is understood perfectly by the recipient. In the world of real systems, understanding the intent of a message is often a challenge. I’m going to call this the British General’s Problem, in honor of General Raglan.

Raglan was a British general during the Crimean War. The messages he sent to Lord Lucan at the front were misinterpreted, and these misinterpretations contributed to a disastrous attack by British Light Calvary against fortified Russian troops, famously memorialized in Alfred Lord Tennyson’s poem at the top of this post.

Stick figure holding a doc and confused by it
Lord Lucan is confused by General Raglan’s message

Tim Harford tells the story of the misunderstanding, and the different personalities involved in the affair, in the wonderful episode of the Cautionary Tales podcast The Curse of Knowledge meets the Valley of Death.

This story is a classic example of a common ground breakdown, a phenomenon described in the famous paper Common Ground and Coordination in Joint Activity by Gary Klein, Paul Feltovich, Jeffrey Bradshaw and David Woods. In the paper, the authors describe how common ground is necessary for people to coordinate effectively to achieve a shared goal. A common ground breakdown happens when a misunderstanding occurs among the people coordinating.

If you do operations work, and you’ve been involved in remediating an incident while communicating over Slack, you’ve experience the challenge of maintaining common ground. Because common ground breakdowns are common, coordination requires ongoing effort to repair this common ground. This is the topic of Laura Maguire’s PhD thesis: Controlling the Costs of Coordination in Large-scale Distributed Software Systems (which, I must admit, I haven’t yet read).

The British General’s problem is a reminder of the challenges we inevitably face in socio-technical systems. The real problem is not unreliable channels: it’s building and maintaining the shared understanding in order to get the coordination work done.