You’re just going to sit there???

Here’s a little story about something that happened last year.

A paging alert fires for a service that a sibling team manages. I’m the support on-call, meaning that I answered support questions about the delivery engineering tooling. That means my only role here is to communicate with internal users about an ongoing issue. Since I don’t know this service at all, there isn’t much else for me to do: I’m just a bystander, watching the Slack messages from the sidelines.

The operations on-call he acknowledges the page and starts digging to figure out what’s gone wrong. As he’s investigating, he’s providing updates about his progress by posting Slack messages to the on-call channel. At one point, he types this message:

Anyway… we’re dead in the water until this figures itself out.

I’m… flabbergasted. He’s just going to sit there and hope that the system becomes healthy again on its own? He’s not even going to try and remediate? Much to my relief, after a few minutes, the service recovered.

Talking to him the next day, I discovered that he had taken a remediation action: he failed over a supporting service from the primary to the secondary. His comment was referring to the fact that the service was going to be down until the failover completed. Once the secondary became the new primary, things went back to normal.

When I looked back at the Slack messages, I noticed that he had written messages to communicate that he was failing over the primary. But he had also mentioned that his initial attempt at failover didn’t work, as the operational UX was misleading. What happened was that I had misinterpreted the Slack message. I thought his attempt to fail over had simply failed entirely, and he was out of ideas.

Communicating effectively over Slack during a high-tempo event like an incident is challenging. It can be especially difficult if you don’t have a prior working relationship with the people in the ad-hoc incident response team, which can happen when an incident spans multiple teams. Getting better at communicating during an incident is a skill, both for individuals and organizations as a whole. It’s one I think we don’t pay enough attention to.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s