Chasing down the blipperdoodles

To a first approximation, there are two classes of automated alerts:

  1. A human needs to look at this as soon as possible (page the on-call!)
  2. A human should eventually investigate, but it isn’t urgent (email-only alert)

This post is about the second category. These are events like the error spikes that happened at 2am that can wait until business hours to look into.

When I was on the CORE1 team, one of the responsibilities of team members was to investigate these non-urgent alert emails. The team colorfully referred to them as blipperdoodles2, presumably because they look like blips on the dashboard.

I didn’t enjoy this part of the work. Blipperdoodles can be a pain to track down, are often not actionable (e.g., networking transient), and, in the tougher cases, are downright impossible to make sense of. This means that the work feels largely unsatisfying. As a software engineer, I’ve felt a powerful instinct to dismiss transient errors, often with a joke about cosmic rays.

But I’ve really come around on the value of chasing down blipperdoodles. Looking back, they gave me an opportunity to practice doing diagnostic work, in a low-stakes environment. There’s little pressure on you when you’re doing this work, and if something more urgent comes up, the norms of the team allow you to abandon your investigation. After all, it’s just a blipperdoodle.

Blipperdoodles also tend to be a good mix of simple and difficult. Some of them are common enough that experienced engineers can diagnose them by the shape of the graphs. Others are so hard that an engineer has to admit defeat once they reach their self-imposed timebox for the investigation. Most are in between.

Chasing blipperdoodles is a form of operational training. And while it may be frustrating to spend your time tracking down anomalies, you’ll appreciate the skills you’ve developed when the heat is on, which is what happens when everything is on fire.

1 CORE stands for Critical Operations & Reliability Engineering. They’re the centralized incident management team at Netflix.

2I believe Brian Trump coined the term.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s