resilience – Page 11 – Surfing Complexity

If only HP knew what HP knows, we would be three times more productive.
Lew Platt, former CEO of Hewlett-Packard

One pattern that you see over and over again in operational surprises is that a person who was involved in the surprise was missing some critical bit of information. For example, there may be an implicit contract that becomes violated when someone makes a code change. Or there might be a certain batch job that runs every Tuesday at 4PM might trigger and puts some additional load on the database.

Almost always, this kind of information is present in the head of someone else within the organization. It just wasn’t in the head of the person who really needed it at that moment.

I think the problem of missing information is well understood enough that you see variants of it crop in different places. Here are some examples I’ve encountered:

The resilience engineering folks often talk about common ground and the problems that arise when common ground breaks down.
Jorge Aranda wrote his (excellent!) PhD thesis on shared understanding in software organizations.
At Netflix, sharing context is an important part of the culture.

It turns out that experts are very good at accumulating these critical bits of information and recalling them at the appropriate time. Experts are also very good at communicating efficiently with others who share a lot of that critical information in their heads.

However, what experts are not very good at is transmitting this information to others who don’t yet have it. Experts aren’t explicitly aware of the value of all of this information, and so they tend not to volunteer it without being asked. When a newcomer watches an expert in action, a common refrain is, “how did you know to do that?”

The fact that experts aren’t good at sharing the useful information that they know is one of the challenges that incident investigators face. One of the skills of an investigator is how to elicit these bits of knowledge through interviews.

I think that advancing shared understanding in an organization has the potential to be enormously valuable. One of the things that I hope to accomplish with sharing out writeups of operational surprises is to use them as a vehicle for doing so.

Even if there isn’t a single actionable outcome from a writeup, you never know when that critical bit of knowledge that has been implanted in the heads of the readers will come in handy.

In short, the resilience of a system corresponds to its adaptive capacity tuned to the future. [emphasis added]
Branlat, Matthieu & Woods, David. (2010). How do systems manage their adaptive capacity to successfully handle disruptions? A resilience engineering perspective. AAAI Fall Symposium – Technical Report

In simple terms, an incident is a bad thing that has happened that was unexpected. This is just the sort of thing that makes people feel uneasy. Instinctively, we want to be able to say “We now understand what has happened, and we are taking the appropriate steps to make sure that this never happens again.”

But here’s the thing. Taking steps to prevent the last incident from recurring doesn’t do anything to help you deal with the next incident, because your steps will have ensured that the next one is going to be completely different. There is, however, one thing that your next incident will have in common with the last one: both of them are surprises.

We can’t predict the future, but we can get better at anticipating surprise, and dealing with surprise when it happens. Getting better at dealing with surprise is what resilience engineering is all about.

The first step is accepting that surprise is inevitable. That’s hard to do. We want to believe that we are in control of our systems, that we’ve plugged all of the holes. Sure, we may have had a problem before, but we fixed that. If we can just take the time to build it right, it’ll work properly.

Accepting that future operational surprises are inevitable isn’t natural for engineers. It’s not the way we think. We design systems to solve problems, and one of the problems is staying up. We aren’t fatalists.

However, once we do accept that operational surprise is inevitable, we can shift our thinking of the system from the computer-based system to the broader socio-technical system that includes both the people and the computers. The solution space here looks very different, because we aren’t used to thinking about designing systems where people are part of the system, especially when we engineers are part of the system we’re building!

But if we want the ability to handle things the future is going to throw at us, then we need to get better at dealing with surprise. Computers are lousy at this, they can’t adapt to situations they weren’t designed to handle. But people can.

In this frame, accepting that operational surprises are inevitable isn’t fatalism. Building adaptive capacity to deal with future surprises is how we tune to the future.

Category: resilience

Experts aren’t good at building shared understanding

Tuning to the future