(This post was inspired by a conversation I had with a colleague).
On the evening before the launch of the Challenger Space Shuttle, representatives from NASA and the engineering contractor Thiokol held a telecon where they were concerned about the low overnight temperatures at the launch site. The NASA and Thiokol employees discussed whether to proceed with the launch or cancel it. On the call, there’s an infamous exchange between two Thiokol executives:
It’s time to take off your engineering hat and put on your management hat.Senior Vice President Jerry Mason to Vice President of Engineering Robert Lund
The quote implies a conflict between the prudence of engineering and management’s reckless indifference to risk. The story is more complex than this quote suggests, as the sociologist Diane Vaughan discovered in her study of NASA’s culture. Here’s a teaser of her research results:
Contradicting conventional understandings, we find that (1) in every [Flight Readiness Review], Thiokol engineers brought forward recommendations to accept risk and fly and (2) rather than amoral calculation and misconduct, it was a preoccupation with rules, norms, and conformity that governed all facets of controversial managerial decisions at Marshall during this period.Diane Vaughan, The Challenger Launch Decision
But this blog post isn’t about the Challenger, or the contrasts between engineering and management. It’s about the times when we need to change hats.
I’m a fan of the you-build-it-you-run-it approach to software services, where the software engineers are responsible for operating the software they write. But solving ops problems isn’t like solving dev problems: the tempo and the skills involved are different, and they require different mindsets.
This difference is particularly acute for a software engineer when a change that they made contributed to an ongoing incident. Incidents are high pressure situations, and even someone in the best frame of mind can struggle with the challenges of making risky decisions under uncertainty. But if you’re thinking, “Argh, the service is down, and it’s all my fault!“, then your effectiveness is going to suffer. Your head’s not going to be in the right place.
And yet, these moments are exactly when it’s most important to be able to make the context switch between dev work and ops work. If someone took an action that triggered an outage, chances are good that they’re the person on the team who is best equipped to remediate, because they have the most context about the change.
Being the one who pushed the change that takes down the service sucks. But when we are most inclined to spend mental effort blaming ourselves is exactly when we can least afford to. Instead, we have to take off the dev hat, put on the ops hat, and do we can to get our head in the game. Because blaming ourselves in the moment isn’t going to make it any easier to get that service back up.