If you’re categorizing your incidents by cause, here are some options for causes that I’d love to see used. These are all taken directly from the field of cognitive systems engineering research.
All of us are so often working near saturation: we have more work to do than time to do it. As a consequence, we experience pressure to get that work done, and the pressure affects how we do our work and the decisions we make. Multi-tasking is a good example of a symptom of production pressure.
Ask yourself “for the people whose actions contributed to the incident, what was their personal workload like? How did it shape their actions?”
Often we’re trying to achieve multiple goals while doing our work. For example, you may have a goal to get some new feature out quickly (production pressure!), but you also have a goal to keep your system up and running as you make changes. This creates a goal conflict around how much time you should put into validation: the goal of delivering features quickly pushes you towards reducing validation time, and the goal of keeping the system up and running pushes you towards increasing validation time.
If someone asks “Why did you take action X when it clearly contravenes goal G?”, you should ask yourself “was there another important goal, G1, that this action was in support of?”
How do you feel about the quality of the software tools that you use in order to get your work done? (As an example: how are the deployment tools in your org?)
Often the tools that we use are inadequate in one way or another, and so we resort to workarounds: getting our work done in a way that works but is not the “right” way to do it (e.g., not how the tool was designed to be used, against the official process of how to do things). Using workarounds is often dangerous because the system wasn’t designed with that type of work in mind. But if the dangerous way of doing work is the only way that the work can get done, then you’re going to end up with people taking dangerous actions.
If an incident involves someone doing something they weren’t “supposed to”, you should ask yourself, “did they do it this way because they are working around some deficiency in the tools that have to use?”
Software automation often behaves in ways that people don’t expect: we have incorrect mental models of why the system is doing what it is, often because the system isn’t designed in a way to make it easy for us to form good mental models of behavior. (As someone who works on a declarative deployment system, I acutely feel the pain we can inflict on our users in this area).
If someone took the “wrong” action when interacting with a software system in some way, ask yourself “what was their understanding of the state of the world at the time, and what was their understanding of what the result of that action would be? How did they form their understanding of the system behavior?”
Do you find this topic interesting? If so, I bet you’ll enjoy attending the upcoming Learning from Incidents Conference taking place on Feb 15-16, 2023 in Denver, CO.