December 2021 – Surfing Complexity

All ambiguity is resolved by actions of practitioners at the sharp end of the system.
Dr. Richard I. Cook, How Complex Systems Fail

There’s a wonderful book by the late urban planning professor Donald Schön called The Reflective Practitioner: How Professionals Think in Action. In the first chapter, he discusses the “rigor or relevance” dilemma that faces educators in professional degree programs. In the case of a university program aimed at preparing students for a career in software development, this is the “should we teach topological sort or React?” question.

Schön argues that the dilemma itself is a fundamental misunderstanding of the nature of professional work. What it misses is the ambiguity and uncertainty inherent in the work of professional life. The “rigor vs relevance” debate is an argument over the best way to get from the problem to the solution: do you teach the students first principles, or do you teach them how to use the current set of tools? Schön observes that a more significant challenge for professionals is defining the problems to solve in the first place, since an ill-defined problem admits no technical solution at all.

In the varied topography of professional practice, there is a high, hard ground where practitioners can make effective use of research-based theory and technique, and there is a swampy lowland where situations are confusing “messes” incapable of technical solution. The difficulty is that the problems of the high ground, however great their technical interest, are often relatively unimportant to clients or to the larger society, while in the swamp are the problems of greatest human concern.

His use of the term “messes” evokes Russell Ackoff’s use of the term in his paper The Future of Operational Research is Past:

Managers are not confronted with problems that are independent of each other, but with dynamic situations that consist of complex systems of changing problems that interact with each other. I call such situations messes. Problems are abstractions extracted from messes by analysis; they are to messes as atoms are to tables and chairs. We experience messes, tables, and chairs; not problems and atoms

To take another example from the software domain. Imagine that you’re doing quarterly planning, and there’s a collection of reliability work that you’d like to do, and you’re trying to figure out how to prioritize it. You could apply a rigorous approach, where you quantify some values in order to do the prioritization work, and so you try to estimate information like:

the probability of hitting a problem if the work isn’t done
the cost to the organization if the problem is encountered
the amount of effort involved in doing the reliability work

But you’re soon going to discover the enormous uncertainty involved in trying to put a number on any of those things. And, in fact, doing any reliability work can actually introduce new failure modes.

Over and over, I’ve seen the theme of ambiguity and uncertainty appear in ethnographic research that looks at professional work in action. In Designing Engineers, the aerospace engineering professor Louis Bucciarelli did an ethnographic study of engineers in a design firm, and discovered that the engineers all had partial understanding of the problem and solution space, and that their understandings also overlapped only partially. As a consequence, a lot of the engineering work that was done actually involved engineers resolving their incomplete understanding through various forms of communication, often informal. Remarkably, the engineers were not themselves aware of this process of negotiating understandings of the problems and solutions.

The famous Common Ground and Coordination in Joint Activity paper by Gary Klein, Paul Feltovich, and David Woods, makes explicit the role that ambiguity plays in human coordination and communication.

You’ll sometimes hear researchers who study work talk about the process of sensemaking. For example, there’s a paper by Sana Albolino, Richard Cook, and Micahel O’Connor called Sensemaking, safety, and cooperative work in the intensive care unit that describes this type of work in an intensive care unit. I think of sensemaking as an activity that professionals perform to try to resolve ambiguity and uncertainty.

(Ambiguity isn’t always bad. In the book On Line and On Paper, the sociologist Kathryn Henderson describes how engineers use engineering drawings as boundary objects. These are artifacts are that are understood differently by the different stakeholders: two engineers looking at the same drawing will have different mental models of the artifact based on their own domain expertise(!). However, there is also overlap in their mental models, and it is this combination of overlap and the fact that individuals can use the same artifact for different purposes that makes it useful. Here the ambiguity has actual value! In fact, her research shows that computer models, which eliminate the ambiguity, were less useful for this sort of work).

As practitioners, we have no choice: we always have to deal with ambiguity. As noted by Richard Cook in the quote that opens this blog post, we are the ones, at the sharp end, that are forced to resolve it.

Until now, if you wanted to improve your organization’s ability to learn from incidents, there wasn’t a lot of how-to style material you could draw from. Sure, there were research papers you could read (oh, so many research papers!). But academic papers aren’t a great source of advice for someone who is starting on an effort to improve how they do incident analysis.

There simply weren’t any publications about how to get started with doing incident investigations which were targeted at the infotech industry. Your best bet was the Etsy Debrief Facilitation Guide. It was practical, but it focused on only a single aspect of the incident investigation process: the group incident retrospective meeting. And there’s so much more to incident investigation than that meeting.

The folks at Jeli have stepped up to the challenge. They just released Howie: The Post-Incident Guide.

We made a new thing! Jeli's all about evolving the industry, but building a new post-incident process is…super hard. So we did it for you. Introducing📄✨ Howie: The Post- Incident Guide. 📄✨ Download it, print it, share it, frame it. Use it to evolve. https://t.co/O4TF68oiAh
— Jeli.io (@jeli_io) December 8, 2021

Readers of this blog will know that this is a topic near and dear to my heart. The name “Howie” is short for “How we got here“, which is what we call our incident writeups at Netflix. (This isn’t a coincidence: we came up with this name at Netflix when Nora Jones of Jeli and I were on the CORE team).

Writing a guide like this is challenging, because so much of incident investigation is contextual: what you look at it, what questions you ask, will depend on what you’ve learned so far. But there are also commonalities across all investigations; the central activities (constructing timelines, doing one-on-one interviews, building narratives) happen each time. The Howie guide gently walks the newcomer through these. It’s accessible.

When somebody says, “OK, I believe there’s value in learning more from incidents, and we want to go beyond doing a traditional root-cause-analysis. But what should I actually do?”, we now have a canonical answer: go read Howie.

Month: December 2021

The ambiguity of real work

The Howie Guide: How to get started with incident investigations