“What could we have done differently?”

During incident retrospective meetings, I’ve often heard someone ask: “What could we have done differently?” I don’t like this question, and so I never ask it.

A world that never was

I am a firm believer in the idea that the best way to get better at dealing with incidents is to understand how incidents actually happen. After an incident happens, I focus all of my energies on the understanding aspect, because the window of opportunity for studying the incident closes quickly.

Asking “what could we have done differently?” can’t teach us anything about how the incident happened, because it’s asking us to imagine an alternate reality where events unfolded differently. You can’t get a better understanding of why an incident responder took action X by imagining a world where the responder took action Y.

Instead of asking how it could have unfolded differently, you’ll learn a lot more about the incident if you try to understand the frame of mind of the incident responders. What did they see? What did they know at the time? What was confusing to them?

The future, not the past

I believe the question is well-intended, to help us prevent the incident from recurring. In that case, I think a better question would be something along the lines of: “If we encounter similar symptoms in a future incident, what actions should we take?” This sounds like the same question, but it’s not:

“If we encounter similar symptoms” introduces uncertainty into the exercise – the future incident may look like the last one, but it might be different with the same symptoms! When we ask about doing things differently in the past, it’s all too easy to forget about this uncertainty.

Uncertainty is one of the defining characteristics of an incident. The system is behaving in an unexpected way, and we don’t understand why! When we look back on an incident, we should focus on this uncertainty rather than elide it.

Another reason that imagining future scenarios is better that counterfactuals about past scenarios is that our system in the future is different from the one in the past. For example:

  • You may have made changes to the system in the wake of the last incident that prevents the incident from recurring in exactly the same way as before, so the question turns out to be moot.
  • You may have improved the operability of your system in some way (e.g., added an admin interface so you can make an API call instead of poking at the database), so that you have new actions you can take in the future that you couldn’t take in the past.

While I still probably wouldn’t ask this question (I want to spend all of my energy understanding the incident), I think it’s a much better question, because it gives us practice at anticipating future incidents.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s