I was involved in an operational surprise a few weeks ago where some of my colleagues, while not directly involved in handling the incident, nudged us in directions that helped with quick remediation.
In one case, a colleague suggested moving the discussion into a different Slack channel, and in another case, a colleague focused the attention on a potential trigger: some newly inserted database records.
I also remember another operational surprise where an experienced engineer asked someone in Slack, “Hey, there’s a new person on our team, can you explain what X means”, and the response kicked off a series of events which brought someone else in that had more context, which led to the surprise being remediated much more quickly.
These sorts of nudges fly under our radar, and so they’re easy to miss. But they can make the difference between an operational surprise with no customer a multi-hour outage, and they can be contingent on the right person who happens to be in the right Slack channel at the right time, seeing the right message.
Unless we treat this sort of activity as first class when looking at incidents, we won’t really understand how it can be that some incidents get resolved so quickly and some take much longer.