Consider the following scenario:
You’re on a team that owns and operates a service. It’s the weekend, and you’re not on call, and you’re out and about in the world, without your laptop. To alleviate boredom, you pick up your phone and look at your service dashboard, because you’re the kind of person that checks the dashboard every so often, even when you’re not on-call. You notice something unusual: the rate of blocked database transactions has increased significantly. This is the kind of signal you would look into further if you were sitting in front of a computer. Since you don’t have a computer handy, you open up Slack on your phone with the intention of pinging X, your colleague who is on-call for the day. When you enter your team channel, you notice that X is already in the process of investigating an issue with the service.
The question is: do you send X a Slack message about the anomalous signal you saw?
If you asked me this question (say, in an interview), I’d answer “it depends“. If I believed that X was already aware there were elevated blocked transactions, then I wouldn’t send them the message. On the other hand, if I thought they were not aware that blocked transactions was one of the symptoms, and I thought this was useful information to help them figure out what the underlying problem was, then I would send the message.
Now imagine you’re implementing automated alerts for your system. You need to make the same sort of decision: when do you redirect the attention (i.e., page) the on-call?
One answer is: “only page when you can provide information that needs to be acted on imminently, that the on-call doesn’t already have”. But this isn’t a very practical requirement, because the person implementing the alert is never going to have enough information to make an accurate judgment about what the on-call knows. You can use heuristics (e.g., suppress a page if a related page recently fired), but you can never know for certain whether that other page was “related” or was actually a different problem that happens to be coincident.
And therein lies the weakness of automation. For the designer implementing the automation, they will never have enough information at design time to implement the automation to make optimal decisions, because we can’t build automation today that has access to information like “what the on-call currently believes about the state of the world” or “disable the cluster if that’s what the user genuinely intended”. Even humans can’t build perfectly accurate mental models of the mental models of their colleagues, but we can do a lot better than software. For example, we can interpret Slack messages written by our colleagues to get a sense of what they believe.
Since the automation systems we build won’t ever have access to all of the inputs they need to make optimal decisions, when we are designing for cases that are ambiguous, we have to make a judgment call about what to do. We need to use heuristics such as “what is the common case” or “how can we avoid the worst case?” But if statements in our code never allow us to express “it depends on information inaccessible to the program”.