At the time, Nicholas was new to the company. He had just discovered a security vulnerability, and then (unrelated to that security vulnerability), an incident happened and, well, I encourage you to read the whole story first, and then come back to this post.
In the end, the engineers were able to leverage the security vulnerability to help resolve the incident. As is my wont, I made a snarky comment.
But I did want to make a more serious comment about what this story illustrates. In a narrow sense, this security vulnerability helped the New Relic engineers remediate before there was severe impact. But in a broader sense, the following aspects helped them remediate:
- they had useful knowledge of some aspect of the the system (port 22 was open to the world)
- they could leverage that knowledge to improvise a solution (they could use this security hole to log in and make changes to the kafka configuration)
The irony here is that it was a new employee that had the useful knowledge. Typically, it’s the tenured engineers who have this sort of knowledge, as they’ve accumulated it with experience. In this case, the engineer discovered this knowledge right before it was needed. That’s what make this such a great story!
I do think that how Nicholas found it, by “poking around”, is a behavior that comes with general experience, even though he didn’t have much experience at the company.
But being in possession of useful knowledge isn’t enough. You also need to be able to recognize when the knowledge is useful and bring it to bear.
These two attributes: having useful knowledge about the system and the ability to apply that knowledge to improvise a solution, are critical for being able to deal effectively with incidents. Applying these are resilience in action.
It’s not a focus of this particular story, but, in general, this sort of knowledge is distributed across individuals. This means that it’s the ad-hoc team that forms during an incident that needs to possess these attributes.