On the writing styles of some resilience engineering researchers

This post is a brief meditation in the writing styles of four luminaries in the field of resilience engineering: Drs. Erik Hollnagel, David Woods, Sidney Dekker, and Richard Cook.

This post was inspired by a conversation I had with my colleague J. Paul Reed. You can find links to papers by these authors at resiliencepapers.club.

Erik Hollnagel – the framework builder

Hollnagel often writes about frameworks or models. A framework is the sort of thing that you would illustrate with a box and arrow diagram, or a table with two or three columns. Here are some examples of Hollnagellian frameworks:

  • Safety-I vs. Safety-II
  • Functional Resonance Analysis Method (FRAM)
  • Resilience Analysis Grid (RAG)
  • Contextual Control Model (COCOM)
  • Cognitive Reliability and Error Analysis Method (CREAM)

Of the four researchers, Hollnagel’s writings read the most like traditional academic writing. Even his book Joint Cognitive Systems: Foundations of Cognitive Systems Engineering feels like something out of an academic journal. Of the four authors, he is the one I struggle the most with to gain insight from. Ironically, one of my favorite concepts I learned from him, the ETTO principle, is presented more as a pattern in the style of Woods, as described below.

David Woods – the pattern oracle

I believe that a primary goal of academic research is to identify patterns in the world that had not been recognized before. By this measure, David Woods is the most productive researcher I have encountered, in any field! Again and again, Woods identifies patterns inherent in the nature of how humans work and interact with technology, by looking across an extremely broad range of human activity, from power plant controllers to astronauts to medical doctors. Gaining insight from his work is like discovering there’s a white arrow in the FedEx logo: you never imagined it was before it was pointed out, and now that you know it’s impossible not to see.

These patterns are necessarily high-level, and Woods invents new vocabulary out of whole cloth to capture these new concepts. His writing contains terms like anomaly response, joint cognitive systems, graceful extensibility, units of adaptive behavior, net adaptive value, crunches, competence envelopes, dynamic fault management, adaptive stalls, and veils of fluency.

In Woods’s writing, he often introduces or references many new concepts, and writes about how they interact with each other. This style of writing tends to be very abstract. I’ve found that if I can map the concepts back into my own experiences in the software world, then I’m able to internalize them and they become powerful tools in my conceptual toolbox. But if I can’t make the connection, then I find it hard to find a handhold in scaling his writing. It wasn’t until I watched his video lectures, where he discussed many concrete examples, that I was able to really to understand many of his concepts.

Sidney Dekker – the public intellectual

Of the four researchers, Dekker produces the most amount of work written for a lay audience. My entrance into the world of resilience engineering was through his book Drift into Failure. Dekker’s writings in these books tend toward the philosophical, but they don’t read like academic philosophy papers. Rather, it’s more of the “big idea” kind of writing, similar in spirit (although not in tone) to the kinds of books that Nassim Taleb writes. In that sense, Dekker’s writing can go even broader than Woods’s, as Dekker muses on the perception of reality. He is the only one I can imagine writing books with titles such as Just Culture, The Safety Anarchist, or The End of Heaven.

Dekker often writes about how different worldviews shape our understanding of safety. For example, one of his more well-known papers contrasts “new” and “old” views on the nature of human error. In Drift Into Failure, he write about the Newtonian-Cartesian worldview and contrast it with a systems perspective. But he doesn’t present these worldviews as frameworks in the way that Hollnagel would. They are less structured, more qualitatively elaborated.

I’m a fan of the “big idea” style of non-fiction writing, and I was enormously influenced by Drift into Failure, which I found extremely readable. However, I’m particularly receptive to this style of writing, and most of my colleagues tend to prefer his Field Guide to Understanding ‘Human Error’, which is more practical.

Richard Cook – the raconteur

Cook’s most famous paper is likely How Complex Systems Fail, but that style of writing isn’t what comes to mind when I think of Cook (that paper is more of a Woods-ian identification of patterns).

Cook is the anti-Hollnagel: where Hollnagel constructs general frameworks, Cook elaborates the details of specific cases. He’s a storyteller, who is able to use stories to teach the reader about larger truths.

Many of Cook’s papers examine work in the domain of medicine. Because Cook has a medical background (he was a practicing anesthesiologist before he was a researcher), he has deep knowledge of that domain and is able to use it to great effect in his analysis on the interactions between humans, technology, and work. A great example of this is how his paper on the allocation of ICU beds in Being Bumpable. His Re-Deploy talk entitled The Resilience of Bone and Resilience Engineering is another example of leveraging the details of a specific case to illustrate broader concepts.

Of the four authors, I think that Cook is the one who is most effective at using specific cases to explain complex concepts. He functions almost as interpreter for grounding Woods-ian concepts in concrete practice. It’s a style of writing that I aspire to. After all, there’s no more effective way to communicate than to tell a good story.

Chasing down the blipperdoodles

To a first approximation, there are two classes of automated alerts:

  1. A human needs to look at this as soon as possible (page the on-call!)
  2. A human should eventually investigate, but it isn’t urgent (email-only alert)

This post is about the second category. These are events like the error spikes that happened at 2am that can wait until business hours to look into.

When I was on the CORE1 team, one of the responsibilities of team members was to investigate these non-urgent alert emails. The team colorfully referred to them as blipperdoodles2, presumably because they look like blips on the dashboard.

I didn’t enjoy this part of the work. Blipperdoodles can be a pain to track down, are often not actionable (e.g., networking transient), and, in the tougher cases, are downright impossible to make sense of. This means that the work feels largely unsatisfying. As a software engineer, I’ve felt a powerful instinct to dismiss transient errors, often with a joke about cosmic rays.

But I’ve really come around on the value of chasing down blipperdoodles. Looking back, they gave me an opportunity to practice doing diagnostic work, in a low-stakes environment. There’s little pressure on you when you’re doing this work, and if something more urgent comes up, the norms of the team allow you to abandon your investigation. After all, it’s just a blipperdoodle.

Blipperdoodles also tend to be a good mix of simple and difficult. Some of them are common enough that experienced engineers can diagnose them by the shape of the graphs. Others are so hard that an engineer has to admit defeat once they reach their self-imposed timebox for the investigation. Most are in between.

Chasing blipperdoodles is a form of operational training. And while it may be frustrating to spend your time tracking down anomalies, you’ll appreciate the skills you’ve developed when the heat is on, which is what happens when everything is on fire.

1 CORE stands for Critical Operations & Reliability Engineering. They’re the centralized incident management team at Netflix.

2I believe Brian Trump coined the term.

The inevitable double bind

Here are three recent COVID-19 news stories:

The first two stories are about large organizations (the FDA, large banks) moving too slowly in order to comply with regulations. The third story is about the risks of the FDA moving too quickly.

Whenever an agent is under pressure to simultaneously act quickly and carefully, they are faced with a double-bind. If they proceed quickly and something goes wrong, they will be faulted for not being careful enough. If they proceed carefully and something goes wrong, they will be faulted for not moving quickly enough.

In hindsight, it’s easy to identify who wasn’t quick enough and who wasn’t careful enough. But if you want to understand how agents make these decisions, you need to understand the multiple pressures that agents experience, because they are trading these off. You also need to understand what information they had available at the time, as well as their previous experiences. I thought this observation of the behavior of the banks was particularly insightful.

But it does tell a more general story about the big banks, that they have invested so much in at least the formalities of compliance that they have become worse than small banks at making loans to new customers.

Matt Levine

Reactions to previous incidents have unintended consequences to the future. The conclusion to draw here isn’t that “the banks are now overregulated”. Rather, it’s that double binds are unavoidable: we can’t eliminate them by adding or removing regulations. There’s no perfect knob setting where they don’t happen anymore.

Once we accept that double binds are inevitable, we can shift of our focus away from just adjusting the knob and towards work that will prepare agents to make more effective decisions when they inevitably encounter the next double bind.

Embracing the beautiful mess

Nobody likes a mess. Especially in the world of software engineering, we always strive to build well-structured systems. No one ever sets out to build a big ball of mud.

Alas, we are constantly reminded that the systems we work in are messier than we’d like. This messiness often comes to light in the wake of an incident, when we dig in to understand what happened. Invariably, we find that the people are a particularly messy part of the overall system, that the actions that they take contribute to incidents. In the wake of the incident, we identify follow-up work that we hope will bring more order, less mess, into our world. What we miss, though, is the role that the messy nature of our systems play in keeping things working.

When I use the term system here, I mean it in the broader sense of a socio-technical system that includes both the technological elements (software, hardware) and the humans involved, the operators in particular.

Yes, there are neat, well-designed structures in place that help keep our system healthy: elements that include automated integration tests, canary deployments, and staffed on-call rotations. But complementing those structures are informal layers of defense provided by the people in our system. These are the teammates who are not on-call but jump in to help, or folks who just happen to lurk in Slack channels and provide key context at the right moment, to either help diagnose an incident or prevent one from happening in the first place.

This informal, messy system of defense is like a dynamic, overlapping patchwork. And sometimes this system fails: for example, a person who would normally chime in with relevant information happens to be out of the office that day. Or, someone takes an action which, under typical circumstances, would be beneficial, but under the specific circumstances of the incident, actually made things worse.

We would never set out to design a socio-technical system the way our systems actually are. Yet, these organic, messy systems actually work better than the neat, orderly systems that engineers dream of, because of how the messy system leverages human expertise.

It’s tempting to bemoan messiness, and to always try to reduce it. And, yes, messiness can be an indicator of problems in the system: for example, people using workarounds instead of how the system was intended to be used are an example of a kind of messiness that points to a shortcoming in our system.

But the human messiness we see under the ugly light of failure is the messiness that actually helps keep the system up and running when that light isn’t shining. If we want to get better at keeping our systems up and running, we need to understand what the mess looks like when things are actually working. We need to learn to embrace the mess. Because there’s beauty in that mess, the beauty of a system that keeps on running day after day.

How did software get so reliable without proof?

In 1996, the Turing-award-winning computer scientist C.A.R. Hoare wrote a paper with the title How Did Software Get So Reliable Without Proof? In this paper, Hoare grapples with the observation that software seems to be more reliable than computer science researchers expected was possible without the use of mathematical proofs for verification (emphasis added):

Twenty years ago it was reasonable to predict that the size and ambition of software products would be severely limited by the unreliability of their component programs … Dire warnings have been issued of the dangers of safety-critical software controlling health equipment, aircraft, weapons systems and industrial processes, including nuclear power stations … Fortunately, the problem of program correctness has turned out to be far less serious than predicted …

So the questions arise: why have twenty years of pessimistic predictions been falsified? Was it due to successful application of the results of the research which was motivated by the predictions? How could that be, when clearly little software has ever has been subjected to the rigours of formal proof?

Hoare offers five explanations for how software became more reliable: management, testing, debugging, programming methodology, and (my personal favorite) over-engineering.

Looking back on this paper, what strikes me is the absence of acknowledgment of the role that human operators play in the types of systems that Hoare writes about (health equipment, aircraft, weapons systems, industrial processes, nuclear power). In fact, the only time the word “operator” appears in the text is when it precedes the word “error” (emphasis added)

The ultimate and very necessary defence of a real time system against arbitrary hardware error or operator error is the organisation of a rapid procedure for restarting the entire system.

Ironically, the above line is the closest Hoare gets to recognizing the role that humans can play in keeping the system running.

The problem with the question “How did software get so reliable without proof?” is that it’s asking the wrong question. It’s not that software got so reliable without proof: it’s that systems that include software got so reliable without proof.

By focusing only on the software, Hoare missed the overall system. And whether you call them socio-technical systems, software-intensive systems, or joint cognitive systems, if you can’t see the larger system, you are doomed to not even be able to ask the right questions.

Rebrand: Surfing Complexity

You can’t stop the waves, but you can learn to surf.

Jon Kabat-Zinn

When I started this blog, my primary interests were around software engineering and software engineering research, and that’s what I mostly wrote about. Over time, I became more interested in complex systems that include software, sometimes referred to as socio-technical systems. That attracted me initially to chaos engineering, and, more recently, to learning from incidents and resilience engineering.

To reflect the more recent focus on complex systems, I decided to rebrand this blog Surfing Complexity. The term has two inspirations: the quote from Jon Kabat-Zinn at the top of this post, and the book title Surfing Uncertainty by Andy Clark. I also gave the blog a new domain name: surfingcomplexity.blog.

In my experience, software engineers recognize the challenge of complexity, but their primary strategy for addressing complexity is by trying to reduce it (and, when they don’t have the resources to do so, complaining about it). By contrast, the resilience engineering community recognizes that complexity is inevitable in the adaptive universe, and seek to understand what we can do to navigate complexity more effectively.

While I think that we should strive to reduce complexity where possible, I also believe that most strategies for increasing the robustness or safety in a system lead will ultimately lead to an increase in complexity. As an example, consider an anti-lock braking system in a modern car. It’s a safety feature, but it clearly increases the complexity of the automobile.

I really like Kabat-Zinn’s surfing metaphor, because it captures the idea that complexity is inevitable: getting rid of it isn’t an option. However, we can get better at dealing with it.

Rehabilitating “you can’t manage what you can’t measure”

There’s a management aphorism that goes “you can’t manage what you can’t measure”. It is … controversial. W. Edwards Deming, for example, famously derided it. But I think there are two ways to interpret this quote, and they have very different takeaways.

One way to read this is to treat the word measure as a synonym for quantify. When John Allspaw rails against aggregate metrics like mean time to resolve (MTTR), he is siding with Deming in criticizing the idea of relying solely on aggregate, quantitative metrics for gaining insight into your system.

But there’s another way to interpret this aphorism, and it depends on an alternate interpretation of the word measure. I think that observing any kind of signal is a type of measurement. For example, if you’re having a conversation with someone, and you notice something in their tone of voice or their facial expression, then you’re engaged in the process of measurement. It’s not quantitative, but it represents information you’ve collected that you didn’t have before.

By generalizing the concept of measurement, I would recast this aphorism as: what you aren’t aware of, you can’t take into account.

This may sound like a banal observation, but the subtext here is “… and there’s a lot you aren’t taking into account.” A lot of things that are happening in your organization, your system, are largely invisible. And those things, that work, is keeping things up and running.

The concept that there’s invisible work happening that’s creating your availability is at the heart of the learning from incidents in software movement. And it isn’t obvious, even though we all experience it directly.

This invisible work is valuable in the sense that it’s contributing to keeping your system healthy. But the fact that it’s invisible is dangerous because it can’t be taken into account when decisions are made that change the system. For example, I’ve seen technological changes that have made it more difficult for the incident management team to diagnose what’s happening in the system. The teams who introduced those changes were not aware of how the folks on the incident management team were doing diagnostic work.

In particular, one of the dangers of an action-item-oriented approach to incident reviews is that you may end up introducing a change to the system that disrupts this invisible work.

Take the time to learn about the work that’s happening that nobody else sees. Because if you don’t see it, you may end up breaking it.

An old lesson about a fish

Back when I was in college [1], I was required to take several English courses. I still remember an English professor handing out an excerpt from the book ABC of Reading by Ezra Pound [2]:

No man is equipped for modern thinking until he has understood the anecdote of Agassiz and the fish:

A post-graduate student equipped with honours and diplomas went to Agassiz to receive the final and finishing touches. The great man offered him a small fish and told him to describe it.

Post-Graduate Student: ‘That’s only a sunfish.’

Agassiz: ‘I know that. Write a description of it.’

After a few minutes the student returned with the description of the Ichthus Heliodiplodokus, or whatever term is used to conceal the common sunfish from vulgar knowledge, family of Heliichtherinkus, etc., as found in textbooks of the subject.

Agassiz again told the student to describe the fish.

The student produced a four-page essay. Agassiz then told him to look at the fish. At the end of three weeks the fish was in an advanced state of decomposition, but the student knew something about it.

I remember my eighteen-year-old self hating this anecdote. It sounded like Agassiz just wasted the graduate student’s time, leaving him with nothing but a rotting fish for his troubles. As an eventual engineering major, I had no interest in the work of analyzing texts that was required in English courses. I thought such analysis was a waste of time.

It would take about two decades for the lesson of this anecdote to sink into my brain. The lesson I eventually took away from it is that there is real value in devoting significant effort to close study of an object. If you want to really understand something, a casual examination just won’t do.

To me, this is the primary message of the learning from incidents in software movement. Doing an incident investigation, like studying the fish, will take time. Completing an investigation may take weeks, even months. Keep in mind, though, that you aren’t really studying an incident at all: you’re studying your system through the lens of an incident. And, even though the organization will have long since moved on, once you’re done, you’ll know something about your system.

[1] Technically it was CEGEP, but nobody outside of Quebec knows what that is.

[2] Pound is likely retelling an anecdote originally told by either Nathaniel Shaler or Samuel Hubbard Scudder, both of whom were students of Agassiz.

Have you seen this before?

Whenever I interview someone after an incident, a question I try to always ask is “have you ever seen a failure mode like this before?

If the engineer says, “yes”, then I will ask follow-up questions about what happened the last time they encountered something similar, and how long ago that happened. Experienced engineers’ perceptions are shaped by…well…their experiences, and learning about how they encountered a similar issue previously helps me understand how they reacted this time (e.g., why they looked in a log file for a particular error message, or why the reached out to a specific individual over Slack).

If the engineer says “no”, that tells me that the engineer was facing a novel failure mode. This is also a useful bit of context, because I want to learn how expert engineers deal with situations they haven’t previously encountered. How do they try to make sense of these signals they don’t recognize? Where do they look to gather more information? Who do they reach out to?

This is the sort of information that people are happy to share with you, but you have to ask them for it, because they’re unlikely to share it spontaneously unless you ask the right questions, because they don’t realize how relevant it is to understanding the incident.

There is no escape from the adaptive universe

If I had to pick just one idea from the field of resilience engineering that has influenced me the most, it would be David Woods’s notion of the adaptive universe. In his 2018 paper titled The theory of graceful extensibility: basic rules that govern adaptive systems, Woods describes the two assumptions [1] of the adaptive universe:

  1. Resources are always finite.
  2. Change is ongoing.

That’s it! Just two simple assertions, but so much flows from them.

At first glance, the assumptions sound banal. Nobody believes in infinite resources! Nobody believes that things will stop changing! Yet, when we design our systems, it’s remarkable how often we don’t take these into account.

The future is always going to involve changes to our system that we could not foresee at design time, and those changes are always going to be made in a context where we are limited in resources (e.g., time, headcount) and hence will have to make tradeoffs. Instead, we tell ourselves a story about how next time, we’re going to build it right. But, we aren’t, because the next time we’ll also be resource constrained, and so we’ll have to make some decisions for reasons of expediency. And the next time, the system will also change in ways we could never have predicted, invalidating our design assumptions.

Because we are forever trapped in the adaptive universe.

[1] If you watch Woods’s online resilience engineering short course, which precedes this paper, he mentions a third property: surprise is fundamental. But I think this property is a consequence of the first two assumptions rather than requiring an additional assumption, and I suspect that’s why he doesn’t mention it as an assumption in his 2018 paper.