
Modern software systems contain within them a mind-boggling level of complexity. As software engineers, we make this complexity manageable through techniques like decomposition, information hiding, and abstraction. We endeavor to break our systems up into components that interact over well-defined interfaces. By doing this, the surface exposed to individual software engineers is dramatically reduced: no individual has to understand how the entire complex system works in order to contribute to their system. Instead, each software engineer needs to understand only the individual component that they work on, along with the interfaces of the other components that they interact with. Decomposition is synonymous with analysis, where you study a larger thing by breaking it up into smaller pieces that are more amenable to understanding.
You can see this strategy of complexity management in action in microservice architectures. An engineer needs to understand the service that their team owns, and the interfaces of the services that their team calls out to. This architecture effectively bounds the information that an engineer needs in order to work effectively. Microservice architectures aren’t there for scaling the software itself, they’re there for scaling the software organization.
Unfortunately, when the system breaks down, this complexity management strategy breaks down itself. Just as hurricanes don’t respect political boundaries, system failures don’t respect component boundaries. Yes, sometimes the problem in a software system is limited to the failure of a single component. Those are the easiest cases to diagnose and mitigate. However, the hairy incidents are the ones that arise due to unexpected interactions across components. Maybe you have several services that are throwing errors, or maybe none of the services are throwing errors but customers are still seeing incorrect behavior. There’s no obvious change that correlates with the start of impact, or maybe you don’t even know when the impact started because the customer impact isn’t reflected in your existing metrics.
When you’re in the throes of an incident that involves an unexpected interaction, this architecture that was built for managing complexity now works against you. Because you’ve built an analysis solution but you’re now faced with a synthesis problem. You need to understand how the pieces all normally fit together to function in order to determine what is going wrong with the system right now. You’ve optimized to avoid requiring anybody to understand how the whole thing works, but now the whole thing isn’t working, and nobody one person knows how the whole thing works.
The job of the incident responders is to collectively figure out how to do that synthesis. You’ve brought together a group of people who each understand the functions of different components of the system, and you need to work together to build enough of an understanding of how the system functions to debug what’s going wrong. As an ad hoc team, the incident responders have to move up and down the abstraction hierarchy to figure this out.
This sort of in-the-moment reconstruction of system function from component parts is an essential part of incident response for the most complex incidents, but it’s rarely treated as first-class work that’s worthy of study and support. The recent book Crisis Engineering by Marina Nitze, Matthew Weaver, and Mikey Dickerson is the exception that proves the rule: they do discuss the work of building a model of the system during a crisis to help figure out what’s gone wrong. But I struggle to recall any other guidance I’ve read about incident response that talks about how to prepare for doing this sort of work. It’s important work, and it’s difficult, and the ability to do it well can have a huge impact on the time it takes to mitigate the hardest incidents. This is stuff that even the best individual humans struggle with, because it involves a group of humans working together effectively, with each person having a partial model of the system. And if the best humans struggle with it, I don’t think AI SRE tools are going to save us here: if the best humans struggle, the AIs will too. We need to figure out how to get better at this collectively. Like so many things, it’s a coordination problem.