September 2020 – Surfing Complexity

You shouldn’t write up your own incident if you can avoid it. To write up an incident well, you need to be able to capture the perspectives of the different people who were involved. If the write-up author was also one of the responders, then the writeup will be biased towards their perspective, at the expense of capturing the perspectives of the other engineers who were engaged.

Unfortunately, most organizations haven’t committed the resources to support doing independent incident investigations. I happen to privileged enough to work at a company that has hired specialists who are skilled at doing independent incident investigations (J. Paul Reed and Jessica DeVita), Once upon a time (last year, to be precise), I was one of those independent incident investigators, before I transitioned back to being a software engineer.

However, even at my employer, we don’t have the resources to do an independent investigation for every single operational surprise that happens, and so the common case is still that a team has to investigate its own operational surprises.

Recently, I was one of the responders to one of these operational surprises. And, since I’m an advocate of teams putting in the effort to write up their operational surprises and share them with the org, I committed to doing that for my team.

During the operational surprise, we identified that certain database rows weren’t being updated, but we struggled to identify why they weren’t being updated. In the moment, We suspected the problem was somehow related to this function, which is responsible for updating those database rows. We believed (correctly, in hindsight) that the function was being called, because a log statement immediately preceding that function invocation appeared in the logs. But, somehow, the database updates weren’t taking effect.

In the moment, I was looking into whether there was something about the database itself that was preventing writes: perhaps some sort of database lock that was blocking updates? To investigate that, I manually translated the code from jOOQ library calls to raw SQL so I could run the queries directly against the database and see what happened.

In the end, it turned out that the problem was not related to the database itself, but to Kotlin code inside that function that was throwing an exception. It was erroring because the code made certain assumptions about the format of version strings, and those assumptions had become invalid over time. When this code hit a version string it couldn’t process, it threw an exception and triggered a transaction rollback.

After we remediated, when I looked back on the events of the day, I thought “Boy, I sure did waste a big chunk of time manually translating that code to SQL, when the problem wasn’t related to the database at all.“

Later on, when I put my incident investigator hat on and pored over the Slack messages, I discovered something. While I was working to understand the code to translate it, I discovered that one of the queries in that function was too broad. Under normal circumstances, the broadness of the query wasn’t impacting the correctness of the function (the query after it was narrower) or the performance, but during the operational surprise it was increasing the blast radius of the issue. Narrowing the scope of that query was an important part of remediating the incident.

The thing is, until I was investigating the incident, I didn’t realize that I had learned about the broad query issue because I was working to translate the code into SQL. That work I did had real value: it helped us resolve the issue.

Ever since I’ve been bitten by the learning from incidents bug, I’ve been a believer in the value of using an independent investigator. But this is the first time I really had this first-hand experience of I learned something new about my own work in resolving the incident, even though I was there, because of the post-incident investigation work. It was quite a visceral realization.

And so, while you really should take advantage of independent investigators if resources permit, if you’ve worked as an independent investigator and then transition to a role which includes incident response, I recommend trying to write up one of your own incidents, at least once. It really reinforces how much more can be learned from an incident by doing a good investigation.

Month: September 2020

Why you should write up your own incident