Live-drawing my slides during a talk

The other day, I gave an internal talk, and I tried an experiment. Using my iPad and the GoodNotes app, I drew all of my slides while I was talking (except the first slide, which I drew in advance).

“What font is that?” someone asked. It’s my handwriting

I’ve always been in awe of people who can draw, I’ve never been good at it.

“Where’s the bug”, it says. Not my best handwriting

Over the years, I’ve tried doodling more. I was influenced by Dan Roam’s books, Julia Evans’s zines, sketchnotes, and most recently, Christina Wodtke’s Pencil Me In.

The words have stink lines, so you know they’re bad

If you’ve read my blog before, you’ve seen some of my previous doodles (e.g., Root cause of failure, root cause of success or Taming complexity: from contract to compact).

We need to complete the action items so it never happens again

When I was asked to present to a team, I wanted to use my drawings rather than do traditional slides. I actually hate using tools like PowerPoint and Google Slides to do presentations. Typically I use Deckset, but in this case, I wanted to do them all drawn.

A different perspective on incidents

I started off by drawing out my slides in advance. But then I thought, “instead of showing pre-drawn slides, why don’t I draw the slides as I talk? That way, people will know where to look because they’ll look at where I’m drawing.”

I still had to prepare the presentation in advance. I drew all of the slides beforehand. And then I printed them out and had them in front of me so that I could re-draw them during the talk. Since it was done over Zoom, people couldn’t actually see that I was working from the print-outs (although they might have heard the paper rustling).

Contributing factors aren’t like root cause

One benefit of this technique was that it made it easier to answer questions, because I could draw out my answer. When I was writing the text at the top, somebody asked, “Is that something like a root cause chain?” I drew the boxes and arrows in response, to explain how this isn’t chain-like, but instead is more like a web.

The selected images above should give you a sense of what my slides looked like. I had fun doing the presentation, and I’d try this approach again. It was certainly more enjoyable than futzing with slide layout.

Useful knowledge and improvisation

Eric Dobbs recently retold a story on twitter (a copy is on his wiki) about one of his former New Relic colleagues, Nicholas Valler.

At the time, Nicholas was new to the company. He had just discovered a security vulnerability, and then (unrelated to that security vulnerability), an incident happened and, well, I encourage you to read the whole story first, and then come back to this post.

In the end, the engineers were able to leverage the security vulnerability to help resolve the incident. As is my wont, I made a snarky comment.

But I did want to make a more serious comment about what this story illustrates. In a narrow sense, this security vulnerability helped the New Relic engineers remediate before there was severe impact. But in a broader sense, the following aspects helped them remediate:

  • they had useful knowledge of some aspect of the the system (port 22 was open to the world)
  • they could leverage that knowledge to improvise a solution (they could use this security hole to log in and make changes to the kafka configuration)

The irony here is that it was a new employee that had the useful knowledge. Typically, it’s the tenured engineers who have this sort of knowledge, as they’ve accumulated it with experience. In this case, the engineer discovered this knowledge right before it was needed. That’s what make this such a great story!

I do think that how Nicholas found it, by “poking around”, is a behavior that comes with general experience, even though he didn’t have much experience at the company.

But being in possession of useful knowledge isn’t enough. You also need to be able to recognize when the knowledge is useful and bring it to bear.

These two attributes: having useful knowledge about the system and the ability to apply that knowledge to improvise a solution, are critical for being able to deal effectively with incidents. Applying these are resilience in action.

It’s not a focus of this particular story, but, in general, this sort of knowledge is distributed across individuals. This means that it’s the ad-hoc team that forms during an incident that needs to possess these attributes.

Inconceivable

Back in July, Ray Ashman at Mailchimp posted a wonderful writeup of an internal incident (h/t to SRE Weekly). It took the Mailchimp engineers almost two days to make sense of the failure mode.

The trigger was a change to a logging statement, in order to log an exception. During the incident, the engineers noticed that this change lined up with the time that the alerts fired. But, other than the timing, there wasn’t any evidence to suggest that the log statement change was problematic. The change didn’t have any apparent relationship to the symptoms they were seeing with the job runner, which was in a different part of the codebase. And so they assumed that the logging statement change was innocuous.

As it happened, there was a coupling between that log statement and the job runner. Unfortunately for the engineers, this coupling was effectively invisible to them. The connection between the logging statement and the job running was Mailchimp’s log processing pipeline. Here’s an excerpt from the writeup (emphasis mine):

Our log processing pipeline does a bit of normalization to ensure that logs are formatted consistently; a quirk of this processing code meant that trying to log a PHP object that is Iterable would result in that object’s iterator methods being invoked (for example, to normalize the log format of an Array).

Normally, this is an innocuous behavior—but in our case, the harmless logging change that had shipped at the start of the incident was attempting to log PHP exception objects. Since they were occurring during job execution, these exceptions held a stacktrace that included the method the job runner uses to claim jobs for execution (“locking”)—meaning that each time one of these exceptions made it into the logs, the logging pipeline itself was invoking the job runner’s methods and locking jobs that would never be actually run! 

Fortunately, there were engineers who had experience with this failure mode before:

Since the whole company had visibility into our progress on the incident, a couple of engineers who had been observing realized that they’d seen this exact kind of issue some years before.



Having identified the cause, we quickly reverted the not-so-harmless logging change, and our systems very quickly returned to normal.

In the moment, the engineers could not conceive of how a change in behavior in the job runner could be affected by the modification of a log statement in an unrelated part of the code. It was literally unthinkable to them.

Software Misadventures Podcast

I was recently a guest on the Software Misadventures Podcast.

Bruno Connelly – Building and leading the global SRE org at LinkedIn – #14 Software Misadventures

Bruno Connelly is a VP of Engineering at LinkedIn. He leads the Site Engineering org responsible for LinkedIn's production infrastructure. He joins the show to talk about his journey in tech – from teaching himself how to code at a young age, building, maintaining and reverse engineering software as a teenager, building ISPs in the early part of his career (there are some fun stories that involve sleeping in the data center) to leading the SRE org at LinkedIn over the last decade. He talks about the early days at LinkedIn that involved a lot of firefighting to keep the site up, how the team built technical stability and scaled the platform. We also dive into how he grew the SRE org globally and overcame challenges that came with the growth. Throughout the conversation, he shares various nuggets of wisdom – like how to stay calm under pressure and how to make people feel at ease – as he describes his leadership style, people who have influenced him and what he thinks is a positive way to collaborate with people.   Website link: https://softwaremisadventures.com/bruno   Music Credits: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en
  1. Bruno Connelly – Building and leading the global SRE org at LinkedIn – #14
  2. Lorin Hochstein – On how Netflix learns from incidents, software as socio-technical systems, writing persuasively and more – #13
  3. Spoons (Daniel Spoonhower) – On building Lightstep, being customer focused, developing systems at Google scale and much more – #12
  4. Emmanuel Ameisen – On production ML at Stripe scale, leading 100+ ML projects, iterating fast, and much more – #11
  5. Todd Underwood – On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more – #10

Contempt for the glue people

The clip below is from a lecture from 2008(?) that then-Google CEO Eric Schmidt gave to a Stanford class.

Here’s a transcript, emphasis mine.

When I was at Novell, I had learned that there were people who I call “glue people”. The glue people are incredibly nice people who sit at interstitial boundaries between groups, and they assist in activity. And they are very, very loyal, and people love them, and you don’t need them at all.

At Novell, I kept trying to get rid of these glue people, because they were getting in the way, because they slowed everything down. And every time I get rid of them in one group, they’d show up in another group, and they’d transfer, and get rehired and all that.

I was telling Larry [Page] and Sergey [Brin] this one day, and Larry said, “I don’t understand what your problem is. Why don’t we just review all of the hiring?” And I said, “What?” And Larry said, “Yeah, let’s just review all the hiring.” And I said, “Really?” He said, “Yes”.

So, guess what? From that moment on, we reviewed every offer packet, literally every one. And anybody who smelled or looked like a glue person, plus the people that Larry and Sergey thought had backgrounds that I liked that they didn’t, would all be fired.

I first watched this lecture years ago, but Schmidt’s expressed contempt for the nice and loyal but useless glue people just got lodged in my brain, and I’ve never forgotten it. For some reason, this tweet about Google’s various messaging services sparked my memory about it, hence this post.

Remembering the important bits when you need them

I’m working my way through the Cambridge Handbook of Expertise and Expert Performance, which is a collection of essays from academic researchers who study expertise.

Chapter 6 discusses the ability of experts to recall information that’s relevant to the task at hand. This is one of the differences between experts and novices: a novice might answer questions about a subject correctly on a test, but when faced with a real problem that requires that knowledge, they aren’t able to retrieve it.

The researchers K. Anders Ericsson and Walter Kintsch had an interesting theory about how experts do better at this than novices. The theory goes like this: when an expert encounters some new bit of information, they have the ability to encode that information into their long-term memory in association with a collection of cues of when that information would be relevant.

In other words, experts are able to predict the context when that information might be relevant in the future, and are able to use that contextual information as a kind of key that they can use to retrieve the information later on.

Now, think about reading an incident write-up. You might learn about a novel failure mode in some subsystem your company uses (say, a database), as well as the details that led up to it happening, including some of the weird, anomalous signals that were seen earlier on. If you have expertise in operations, you’ll encode information about the failure mode into your long term memory and associate it with the symptoms. So, the next time you see those symptoms in production, you’ll remember this failure mode.

This will only work if the incident write-up has enough detail to provide you with the cues that you need to encode in your memory. This is another reason to provide a rich description of the incident. Because the people reading it, if they’re good at operations, will encode the details of the failure mode into their memory. If it happens again, and they read the write up, they’ll remember.

The power of framing a problem

I’m enjoying Marianne Bellotti’s book Kill It With Fire, which is a kind of guidebook for software modernization projects (think: migrating legacy systems). In Chapter Five, she talks about the importance of momentum for success, and how a crisis can be a valuable tool for creating a sense of urgency. This is the passage that really resonated with me (emphasis in the original):

Occasionally, I went as far as looking for a crisis to draw attention to. This usually didn’t require too much effort. Any system more than five years old will have at least a couple major things wrong with it. It didn’t mean lying, and it didn’t mean injecting problems where they didn’t exist. Instead, it was a matter of storytelling—taking something that was unreported and highlighting its potential risks. These problems were problems, and my analysis of their potential impact was always truthful, but some of them could have easily stayed buried for months or years without triggering a single incident.

Kill It With Fire, p88

This is a great explanation of how describing a problem is a form of power in an organization. Bellotti demonstrates how, by telling a story, she was able to make problems real for an organization, even to the point of creating a crisis. And a crisis receives attention and resources. Crises get resolved.

It’s also a great example the importance of storytelling in technical organizations. Tell a good story, and you can make things happen. It’s a skill that’s worth getting better at.

The local nature of culture

I’m really enjoying Turn the Ship Around!, a book by David Marquet about his experiences as commander of a nuclear submarine, the USS Santa Fe, and how he worked to improve its operational performance.

One of the changes that Marquet introduced is something he calls “thinking out loud”, where he encourages crew members to speak aloud their thoughts about things like intentions, expectations, and concerns. He notes that this approach contradicted naval best practices:

As naval officers, we stress formal communications and even have a book, the Interior Communications Manual, that specifies exactly how equipment, watch stations, and evolutions are spoken, written, and abbreviated …

This adherence to formal communications unfortunately crowds out the less formal but highly important contextual information needed for peak team performance. Words like “I think…” or “I am assuming…” or “It is likely…” that are not specific and concise orders get written up by inspection teams as examples of informal communications, a big no-no. But that is just the communication we need to make leader-leader work.

Turn the Ship Around! p103

This change did improve the ship operations, and this improvement was recognized by the Navy. Despite that, Marquet still got pushback for violating norms.

[E]ven though Santa Fe was performing at the top of the fleet, officers steeped in the leader-follower mind-set would criticize what they viewed as the informal communications on Santa Fe. If you limit all discussion to crisp orders and eliminate all contextual discussion, you get a pretty quiet control room. That was viewed as good. We cultivated the opposite approach and encouraged a constant buzz of discussions among the watch officers and crew. By monitoring that level of buzz, more than the actual content, I got a good gauge of how well the ship was running and whether everyone was sharing information.

Turn the Ship Around! p103

Reading this reminded me how local culture can be. I shouldn’t be surprised, though. At Netflix, I’ve worked on three teams (and six managers!) and each team had very different local cultures, despite all of them being in the same organization, Platform Engineering.

I used to wonder, “how does a large company like Google write software?” But I no longer think that’s a meaningful question. It’s not Google as an organization that writes software, it’s individual teams that do. The company provides the context that the teams work in, and the teams are constrained by various aspects of the organization, including the history of the technology they work on. But, there’s enormous cultural variation from one team to the next. And, as Marquet illustrates, you can change your local culture, even cutting against organizational “best practices”.

So, instead of asking, “what is it like to work at company X”, the question you really want answered is, “what is it like to work on team Y at company X?”

What do you work on, anyway?

I often struggle to describe the project that I work on at my day job, even though it’s an open-source project that even has its own domain name: managed.delivery. I’ll often mumble something like, “it’s a declarative deployment system”. But that explanation does not yield much insight.

I’m going to use Kubernetes as an analogy to explain my understanding of Managed Delivery. This is dangerous, because I’m not a Kubernetes user(!). But if I didn’t want to live dangerously, I wouldn’t blog.

With Kubernetes, you describe the desired state of your resources declaratively, and then the system takes action to bring the current state of the system to the desired state. In particular, when you use Kubernetes to launch a pod of containers, you need to specify the container image name and version to be deployed as part of the desired state.

When a developer pushes new code out, they need to change the desired state of a resource, specifically, the container image version. This means that a deployment system needs some mechanism for changing the desired state.

A common pattern we see is that service owners have a notion of an environment (e.g., test, staging, prod). For example, maybe they’ll deploy the code to test, and maybe run some automated tests against it, and if it looks good, they’ll promote to staging, and maybe they’ll do some manual tests, and if they’re happy, they’ll promote out to prod.

Example of deployment environments

Imagine test, staging, and prod all have version v23 of the code running in it. After version v24 is cut, it will first be deployed in test, then staging, then prod. That’s how each version will propagate through these environments, assuming it meets the promotion constraints for each environment (e.g., tests pass, human makes a judgment).

You can think of this kind of promoting-code-versions-through-environments as a pattern for describing how the desired states of the environments changes over time. And you can describe this pattern declaratively, rather than imperatively like you would with traditional pipelines.

And that’s what Managed Delivery is. It’s a way of declaratively describing how the desired state of the resources should evolve over time. To use a calculus analogy, you can think of Managed Delivery as representing the time-derivative of the desired state function.

If you think of Kubernetes as a system for specifying desired state, Managed Delivery is a system for specifying how desired state evolves over time

With Managed Delivery, you can say express concepts like:

  • for a code version to be promoted to the staging environment, it must
    • be successfully deployed to the test environment
    • pass a suite of end-to-end automated tests specified by the app owner

and then Managed Delivery uses these environment promotion specifications to shepherd the code through the environments.

And that’s it. Managed Delivery is a system that lets users describe how the desired state changes over time, by letting them specify environments and the rules for promoting change from one from environment to the next.

Modernists trapped in a post-modern universe

There’s a wonderful book by the political philosopher Marshall Berman called All That is Solid Melts Into Air. The subtitle of the book is the experience of modernity, and, indeed, the book tries to capture the feeling of what is like to live in the modern period, as illustrated through the writings of famous modernist authors, both fiction and non.

Berman demonstrates how the modern era, particularly in the late 19th and early 20th century, was a time period of great ferment. The world was seen as turbulent, dynamic. People believed that the world we lived in was not a fixed entity, but that it could be reshaped, remade entirely. The title of a book is a quote from Karl Marx, who was alluding to the idea that all of the structures we see in the world are ephemeral.

In contrast, in the post-modernist view of the world that came later, we can never cast off our history and start from scratch. Every complex system has a history, and that history continues to constrain the behavior of the system, even though it undergoes change.

We software engineers are modernists at heart. We see the legacy systems in our organizations and believe that, when we have the opportunity to work on replacement systems, we will remake our little corner of the world anew. Alas, on this point, the post-modernists were right. While we can change our systems, even replace subsystems wholesale, we can never fully escape the past. We ignore the history of the system at our peril.