The strange beauty of strange loop failure modes

As I’ve posted about previously, at my day job, I work on a project called Managed Delivery. When I first joined the team, I was a little horrified to learn that the service that powers Managed Delivery deploy itself using Managed Delivery.

“How dangerous!”, I thought. What if we push out a change that breaks Managed Delivery? How will we recover? However, after having been on the team for over a year now, I have a newfound appreciation for this approach.

Yes, sometimes there’s something that breaks, and that makes it harder to roll back, because Managed Delivery provides the main functionality for easy rollback. However, it also means that the team gets quite a bit of practice at bypassing Managed Delivery when something goes wrong. They know how to disable Managed Delivery and use the traditional Spinnaker UI to deploy an older version. They know how to poke and prod at the database if the Managed Delivery UI doesn’t respond properly.

These strange loop failure modes are real: if Managed Delivery breaks, we may lose out on the functionality of Managed Delivery to help us recover. But it also means that we’re more ready for handling the situation if something with Managed Delivery goes awry. Yes, Managed Delivery depends on itself, and that’s odd. But we have experience with how to handle things when this strange loop dependency creates a problem. And that is a valuable thing.

Live-drawing my slides during a talk

The other day, I gave an internal talk, and I tried an experiment. Using my iPad and the GoodNotes app, I drew all of my slides while I was talking (except the first slide, which I drew in advance).

“What font is that?” someone asked. It’s my handwriting

I’ve always been in awe of people who can draw, I’ve never been good at it.

“Where’s the bug”, it says. Not my best handwriting

Over the years, I’ve tried doodling more. I was influenced by Dan Roam’s books, Julia Evans’s zines, sketchnotes, and most recently, Christina Wodtke’s Pencil Me In.

The words have stink lines, so you know they’re bad

If you’ve read my blog before, you’ve seen some of my previous doodles (e.g., Root cause of failure, root cause of success or Taming complexity: from contract to compact).

We need to complete the action items so it *never happens again*

When I was asked to present to a team, I wanted to use my drawings rather than do traditional slides. I actually hate using tools like PowerPoint and Google Slides to do presentations. Typically I use Deckset, but in this case, I wanted to do them all drawn.

I started off by drawing out my slides in advance. But then I thought, “instead of showing pre-drawn slides, why don’t I draw the slides as I talk? That way, people will know where to look because they’ll look at where I’m drawing.”

I still had to prepare the presentation in advance. I drew all of the slides beforehand. And then I printed them out and had them in front of me so that I could re-draw them during the talk. Since it was done over Zoom, people couldn’t actually see that I was working from the print-outs (although they might have heard the paper rustling).

Contributing factors aren’t like root cause

One benefit of this technique was that it made it easier to answer questions, because I could draw out my answer. When I was writing the text at the top, somebody asked, “Is that something like a root cause chain?” I drew the boxes and arrows in response, to explain how this isn’t chain-like, but instead is more like a web.

The selected images above should give you a sense of what my slides looked like. I had fun doing the presentation, and I’d try this approach again. It was certainly more enjoyable than futzing with slide layout.

Useful knowledge and improvisation

Eric Dobbs recently retold a story on twitter (a copy is on his wiki) about one of his former New Relic colleagues, Nicholas Valler.

At the time, Nicholas was new to the company. He had just discovered a security vulnerability, and then (unrelated to that security vulnerability), an incident happened and, well, I encourage you to read the whole story first, and then come back to this post.

Wow was that ever a great suggestion, @alicegoldfuss! What a story! 0/8
— Eric Dobbs (@dobbse) September 12, 2021

In the end, the engineers were able to leverage the security vulnerability to help resolve the incident. As is my wont, I made a snarky comment.

Root cause of success: unpatched security vulnerability https://t.co/RpWVOosvCg
— Lorin Hochstein (@norootcause) September 12, 2021

But I did want to make a more serious comment about what this story illustrates. In a narrow sense, this security vulnerability helped the New Relic engineers remediate before there was severe impact. But in a broader sense, the following aspects helped them remediate:

they had useful knowledge of some aspect of the the system (port 22 was open to the world)
they could leverage that knowledge to improvise a solution (they could use this security hole to log in and make changes to the kafka configuration)

The irony here is that it was a new employee that had the useful knowledge. Typically, it’s the tenured engineers who have this sort of knowledge, as they’ve accumulated it with experience. In this case, the engineer discovered this knowledge right before it was needed. That’s what make this such a great story!

I do think that how Nicholas found it, by “poking around”, is a behavior that comes with general experience, even though he didn’t have much experience at the company.

"I found a security hole and was in the process of figuring out how to report to security when the the network control plane was accidentally sheared, in particular leaving all our systems inaccessible by ssh." 3/8
— Eric Dobbs (@dobbse) September 12, 2021

But being in possession of useful knowledge isn’t enough. You also need to be able to recognize when the knowledge is useful and bring it to bear.

"I mentioned it to Dana and Alice very timidly: “I think I may know a way into our systems…”. I was pretty nervous because I figured security would be peeved." 7/8
— Eric Dobbs (@dobbse) September 12, 2021

These two attributes: having useful knowledge about the system and the ability to apply that knowledge to improvise a solution, are critical for being able to deal effectively with incidents. Applying these are resilience in action.

It’s not a focus of this particular story, but, in general, this sort of knowledge is distributed across individuals. This means that it’s the ad-hoc team that forms during an incident that needs to possess these attributes.