Once upon a time, whenever I was involved in responding to an incident, and a teammate ended up diagnosing the failure mode, I would kick myself afterwards. How come I couldn’t figure out what was wrong? Why hadn’t I thought to do what they had done?
However, after enough exposure to the cognitive systems engineering literature, something finally clicked in my mind. When a group of people respond to an incident, it’s never the responsibility of a single individual to remediate. It can’t be, because we each know our own corners of the system better than our teammates. Instead, it is the responsibility of the group of incident responders as a whole to resolve the incident.
The group of incident responders, that ad-hoc team that forms in the moment, is what’s referred to as a joint cognitive system. It’s the responsibility of the individual responders to coordinate effectively so that the cognitive system can solve the problem. Often that involves dynamically distributing the workload so that individuals can focus on specific tasks.
Resolving incidents is a team effort. Go team!
Here’s a little story about something that happened last year.
A paging alert fires for a service that a sibling team manages. I’m the support on-call, meaning that I answered support questions about the delivery engineering tooling. That means my only role here is to communicate with internal users about an ongoing issue. Since I don’t know this service at all, there isn’t much else for me to do: I’m just a bystander, watching the Slack messages from the sidelines.
The operations on-call he acknowledges the page and starts digging to figure out what’s gone wrong. As he’s investigating, he’s providing updates about his progress by posting Slack messages to the on-call channel. At one point, he types this message:
Anyway… we’re dead in the water until this figures itself out.
I’m… flabbergasted. He’s just going to sit there and hope that the system becomes healthy again on its own? He’s not even going to try and remediate? Much to my relief, after a few minutes, the service recovered.
Talking to him the next day, I discovered that he had taken a remediation action: he failed over a supporting service from the primary to the secondary. His comment was referring to the fact that the service was going to be down until the failover completed. Once the secondary became the new primary, things went back to normal.
When I looked back at the Slack messages, I noticed that he had written messages to communicate that he was failing over the primary. But he had also mentioned that his initial attempt at failover didn’t work, as the operational UX was misleading. What happened was that I had misinterpreted the Slack message. I thought his attempt to fail over had simply failed entirely, and he was out of ideas.
Communicating effectively over Slack during a high-tempo event like an incident is challenging. It can be especially difficult if you don’t have a prior working relationship with the people in the ad-hoc incident response team, which can happen when an incident spans multiple teams. Getting better at communicating during an incident is a skill, both for individuals and organizations as a whole. It’s one I think we don’t pay enough attention to.