Unfortunately, the first few minutes were lost due to technical issues. You’ll just have to take my word for it that the missing part of my talk was truly astounding, a veritable tour de force.
Starting after World War II, the idea was culture is accelerating. Like the idea of an accelerated culture was just central to everything. I feel like I wrote about this in the nineties as a journalist constantly. And the internet seemed like, this is gonna be the ultimate accelerant of this. Like, nothing is going to accelerate the acceleration of culture like this mode of communication. Then when it became ubiquitous, it sort of stopped everything, or made it so difficult to get beyond the present moment in a creative way.
We software developers are infamous for our documentation deficiencies: the eternal lament is that we never write enough stuff down. If you join a new team, you will inevitably discover that, even if some important information is written down, there’s also a lot of important information that is tacit knowledge of the team, passed down as what’s sometimes referred to as tribal lore.
But writing things down has a cost beyond the time and effort required to do the writing: written documents are durable, which means that they’re harder to change. This durability is a strength of documentation, but it’s also a weakness. Writing things down has a tendency to ossify the content, because it’s much more expensive to update than tacit knowledge. Tacit knowledge is much more fluid: it adapts to changing circumstances much more quickly and easily than updating documentation, as anybody who has dealt with out-of-date written procedures can attest to.
Back when I was an engineering student, I wanted to know “How do the big companies develop software? How does it happen in the real world?”
Now that I work at a company that has to do large-scale software development, I understand better why it’s not something you can really teach effectively in a university setting. It’s not that companies doing large-scale software development are somehow better at writing software than companies that work on smaller-scale software projects. It’s that large-scale projects face challenges that small-scale projects don’t.
The biggest challenge at large-scale is coordination. My employer provides a single service, which means that, in theory, any project that anyone is working on inside of the company could potentially impact what anybody else is working on. In my specific case, I work on delivery tools, so we might be called upon to support some new delivery workflow.
You can take a top-down command-and-control style approach to the problem, by having the people at the top attempting to filter all of the information to just what they need, and them coordinating everyone hierarchically. However, this structure isn’t effective in dynamic environments: as the facts on the ground change, it takes too long for information to work its way up the hierarchy, adapt, and then change the orders downwards.
You can take a bottoms-up approach to the problem where you have a collection of teams that work autonomously. But the challenge there is getting them aligned. In theory, you hire people with good judgment, and provide them with the right context. But the problem is that there’s too much context! You can’t just firehose all of the available information to everyone, that doesn’t scale: everyone will spend all of their time reading docs. How do you get the information into the heads of the people that need it? becomes the grand challenge in this context.
It’s hard to convey the nature of this problem in a university classroom, if you’ve never worked in a setting like this before. The flurry of memos, planning documents, the misunderstandings, the sync meetings, the work towards alignment, the “One X” initiatives, these are all things that I had to experience viscerally, on a first-hand basis, to really get a sense of the nature of the problem.
Chris Pruett – On deciding to leave LinkedIn and co-founding Jam, values based decision making and compassionate leadership – #19 –
Software Misadventures
Chris Pruett is the CTO and Co-founder of Jam – a new way to share and listen to bite-sized audio. Prior to Jam, Chris spent 9+ years at LinkedIn growing from an engineering manager to VP of Engineering. During his tenure at LinkedIn, he worked on almost all aspects of the app and towards the end, led an org of 500+ engineers working on Feed, Messaging, Identity and Search. In this episode, we discuss how he made the decision to leave his leadership position at LinkedIn and co-found Jam. We also spoke about his time at LinkedIn and how he developed the practice to make value based decisions both in professional and personal life.
Alex Kessinger (Stitch Fix) and David Noël-Romas (Stripe) –
StaffEng
This episode is a celebration of the journey we have been on as this podcast comes to a close. We have had such a great time bringing you these interviews and we are excited about a new chapter, taking the lessons we have learned forward into different spaces. It's been a lot of work putting this show together, but it has also been such a pleasure doing it. And, as we all know, nothing good lasts forever! So to close the circle in a sense, we decided to host a conversation between the two of us where we interview each other as we have with our guests in the past, talking about mentorship, resources, coding as a leader, and much more! We also get into some of our thoughts on continuous delivery, prioritizing work, our backgrounds in engineering, and how to handle disagreements. As we enter new phases in our lives, we want to thank everyone for tuning in and supporting us and we hope to reconnect with you all in the future!LinksDavid Noël-Romas on TwitterAlex Kessinger on TwitterStitch FixStripeJavaScript: The Good PartsDouglas CrockfordMonkeybrainsKill It With FireTrillion Dollar CoachMartha AcostaEtsy Debriefing Facilitation GuideHigh Output Management How to Win Friends & Influence PeopleInfluence
Jens Rasmussen was a giant in the field of safety science research. You can see still his influence on the field, in the writings of safety researchers such as Sidney Dekker, Nancy Leveson, and David Woods.
Reproduction of Fig. 3. The original caption reads: Under the presence of strong gradients behaviour will very likely migrate toward the boundary of acceptable performance
This model looks like it views the state of the system as a point in a state space. But, Rasmussen described it as a model of the humans working within the system. He used the term “work space” rather than “state space”. In addition, Rasmussen used the metaphor of a gas particle undergoing local random movements, a phenomenon known as Brownian also.
Along with the random movements, Rasmussen saw envisioned different forces (he called them gradients) that influenced how the work system would move within the work space. One of these forces was pressure from management to get more work done in order to make the company more profitable. Woods refers to this phenomenon as “faster/better/cheaper pressure“. This is the arrow labeled Management Pressure toward Efficiency, which pushes away from the Boundary to Economic Failure.
One way to get more work done is to give people increasing loads of work. But people don’t like having more and more work piled on them, and so there is opposing pressure from the workforce to reduce the amount of work they have to do. This is the arrow labeled Gradient toward Least Effort which pushes away from the Boundary to Unacceptable Work Load.
The result of those two pressures is movement towards what the diagram labels “the Boundary of functionally acceptable performance”. This is the safety boundary, and we don’t know exactly where it is, which is why there’s a second boundary in the diagram labelled “Resulting perceived boundary of acceptable performance.” Accidents happen when we cross the safety boundary.
Boundary according to Woods
In David Woods’s work, he also writes about the role of boundaries in system safety, but despite this surface similarity, his model isn’t the same as Rasmussen’s.
Instead of a work space, Woods refers to an envelope. He uses terms like competence envelope or design envelope or envelope of performance. Woods has done safety research in aviation, and so I suspect he was influenced by the concept of a flight envelope in aircraft design.
Diagram captioned Altitude envelope from the Wikipedia flight envelope page
The flight envelope defines a region in a state space that the aircraft is designed to function properly within. You can see in the diagram above that the envelope’s boundaries are defined by the stall speed, top speed, and maximum altitude. Bad things happen if you try to operate an aircraft outside of the envelope (hence the phrase pushing the envelope).
Woods’s competence envelope is a generalization of the idea of flight envelope to other types of systems. Any system has a range of inputs that it can handle: if you go outside that range, bad things happen.
Summarizing the differences
To Rasmussen, there is only one boundary in the work space related to accidents: the safety boundary. The other boundaries in the space generally aren’t even reachable, because of the natural pressure away from them. To Woods, the competence envelope is defined by multiple boundaries, and crossing any of them can result in an accident.
Both Rasmussen and Woods identified the role of faster/better/cheaper pressure in accidents. To Rasmussen, this pressure resulted in pushing the system to the safety boundary. But to Woods, this pressure changes the behavior at the boundary. Woods sees this pressure as contributing to brittleness, to systems that don’t perform well as they get close to the boundary of the performance envelope. Woods’s current work focuses on how systems can avoid being brittle by having the ability of moving the boundary as they get closer to it: expanding the competence envelope. He calls this graceful extensibility.
Here’s a question that all of us software developers face: How can we best use our knowledge about the past behavior of our system to figure out where we should be investing our time?
One approach is to use a technique from the SRE world called error budgets. Here are a few quotes from the How to Use Error Budgets chapter of Alex Hidalgo’s book: Implementing Service Level Objectives:
Measuring error budgets over time can give you great insight into the risk factors that impact your service, both in terms of frequency and severity. By knowing what kinds of events and failures are bad enough to burn your error budget, even if just momentarily, you can better discover what factors cause you the most problems over time. p71 [emphasis mine]
The basic idea is straightforward. If you have error budget remaining, ship new features and push to production as often as you’d like; once you run out of it, stop pushing feature changes and focus on relaiability instead. p87
Error budgets give you ways to make decisions about your service, be it a single microservice or your company’s entire customer-facing product. They also give you indicators that tell you when you can ship features, what your focus should be, when you can experiment, and what your biggest risk factors are. p92
The goal is not to only react when your users are extremely unhappy with you—it’s to have better data to discuss where work regarding your service should be moving next. p354
That sounds reasonable, doesn’t it? Look at what’s causing your system to break, and if it’s breaking too often, use that as a signal to address those issues that are breaking it. If you’ve been doing really well reliability-wise, an error budget gives you margin to do some riskier experimentation in production like chaos engineering or production load testing.
I have two issues with this approach, a smaller one and a larger one. I’ll start with the smaller one.
First, I think that if you work on a team where the developers operate their own code (you-build-it, you-run-it), and where the developers have enough autonomy to say, “We need to focus more development effort on increasing robustness”, then you don’t need the error budget approach to help you decide when and where to spend your engineering effort. The engineers will know where the recurring problems are because they feel the operational pain, and they will be able to advocate for addressing those pain points. This is the kind of environment that I am fortunate enough to work in.
I understand that there are environments where the developers and the operators are separate populations, or the developers aren’t granted enough autonomy to be able to influence where engineering time is spent, and that in those environments, an error budget approach would help. But I don’t swim in those waters, so I won’t say any more about those contexts.
To explain my second concern, I need to digress a little bit to talk about Herbert Heinrich.
Herbert Heinrich worked for the Travelers Insurance Company in the first half of the twentieth century. In the 1920s, he did a study of workplace accidents, examining thousands of claims made by companies that held insurance policies with Travelers. In 1931, he published his findings in a book: Industrial Accident Prevention: A Scientific Approach.
Heinrich’s work showed a relationship between the rates of near misses (no injury), minor injuries, and major injuries. Specifically: for every major injury, there are 29 minor injuries, and 300 no-injury accidents. This finding of 1:29:300 became known as the accident triangle.
One implication of the accident triangle is that the rate of minor issues gives us insight into the rate of major issues. In particular, if we reduce the rate of minor issues, we reduce the risk of major ones. Or, as Heinrich put it: Moral—prevent the accidents and the injuries will take care of themselves.
Heinrich’s work has since been criticized, and subsequent research has contradicted Heinrich’s findings. I won’t repeat the criticisms here (see Foundations of Safety Science by Sidney Dekker for details), but I will cite counterexamples mentioned in Dekker’s book:
So, what does any of this have to do with error budgets? At a glance, error budgets don’t seem related to Heinrich’s work at all. Heinrich was focused on safety, where the goal is to reduce injuries as much as possible, in some cases explicitly having a zero goal. Error budgets are explicitly not about achieving zero downtime (100% reliability), they’re about achieving a target that’s below 100%.
Here are the claims I’m going to make:
Large incidents are much more costly to organizations than small ones, so we should work to reduce the risk of large incidents.
Error budgets don’t help reduce risk of large incidents.
Here’s Heinrich’s triangle redrawn:
An error-budget-based approach only provides information on the nature of minor incidents, because those are the ones that happen most often. Near misses don’t impact the reliability metrics, and major incidents blow them out of the water.
Heinrich’s work assumed a fixed ratio between minor accidents and major ones: reduce the rate of minor accidents and you’d reduce the rate of major ones. By focusing on reliability metrics as a primary signal for providing insight into system risk, you only get information about these minor incidents. But, if there’s no relationship between minor incidents and major ones, then maintaining a specific reliability level doesn’t address the issues around major incidents at all.
An error-budget-based approach to reliability implicitly assumes there is a connection between reliability metrics and the risk of a large incident. This is the thread that connects to Heinrich: the unstated idea that doing the robustness work to address the problems exposed by the smaller incidents will decrease the risk of the larger incidents.
In general, I’m skeptical about relying on predefined metrics, such as reliability, for getting insight into the risks of the system that could lead to big incidents. Instead, I prefer to focus on signals, which are not predefined metrics but rather some kind of information that has caught your attention that suggests that there’s some aspect of your system that you should dig into a little more. Maybe it’s a near-miss situation where there was no customer impact at all, or maybe it was an offhand remark made by someone in Slack. Signals by themselves don’t provide enough information to tell you where unseen risks are. Instead, they act as clues that can help you figure out where to dig to get more details. This is what the Learning from Incidents in Software movement is about.
I’m generally skeptical of metrics-based approaches, like error budgets, because they reify. The things that get measured are the things that get attention. I prefer to rely on qualitative approaches that leverage the experiment judgment of engineers. The challenge with qualitative approaches is that you need to expose the experts to the information they need (e.g., putting the software engineers on-call), and they need the space to dig into signals (e.g., allow time for incident analysis).
Over the past few weeks, I’ve had the experience multiple times where I’m communicating with someone in a semi-synchronous way (e.g., Slack, Twitter), and I respond to them without having properly understood what they were trying to communicate.
In one instance, I figured out my mistake during the conversation, and in another instance, I didn’t fully get it until after the conversation had completed, and I was out for a walk.
In these circumstances, I find that I’m primed to respond based on my expectations, which makes me likely to misinterpret. The other person is rarely trying to communicate what I’m expecting them to. Too often, I’m simply waiting for my turn to talk instead of really listening to what they are trying to say.
It’s tempting to blame this on Slack or Twitter, but I think this principle applies in all synchronous or semi-synchronous communications: including face-to-face conversations. I’ve certainly experienced this when I’ve been in technical interviews, where my brain is always primed to think, “What answer is the interviewer looking for to that question?”
John Allspaw uses the term soak time to refer to the additional time it takes us to process the information we’ve received in a post-incident review meeting, so we can make better decisions about what the next steps are. I think it describes this phenomenon well.
Whether you call it soak time, l’esprit de l’escalier, or hammock-driven development, keep in mind that it takes time for your brain to process information. Give yourself permission to take that time. Insist on it.
One technique to help with making a decision is to compute a single metric for each of the options being considered, and then compare the value of those two metrics. A common metric for this situation is to use dollars or ROI (return on investment, which is a unitless ratio of dollars). Are you trying to decide between two internal software development projects? Estimate the ROI for each one and pick the larger one. OKRs (objectives and key results) and error budgets are two other examples of driving decisions using individual metrics, like “where should we focus our effort now?” or “can we push this new feature to production?”
A single-metric-based approach has the virtue of simplifying the final stage in the decision-making process: we simply compare two numbers (either two metrics or a metric against a threshold) in order to make our decision. Yes, it requires mapping the different factors under consideration onto the metric, but it’s tractable, right?
The problem is that the process of mapping the relevant factors into the single metric always involves subjective judgments that ultimately discard information. For example, for ROI calculations, consider the work involved in considering the various different kinds of costs and benefits and mapping those into dollars. Information that should be taken into account when making the final decision vanishes from this process as these factors get mapped into an anemic scalar value.
The problem here isn’t the use of metrics. Rather, it’s the temptation to squeeze all of the relevant information into a form that is representable in a single metric. A single metric frees the decision maker from having to make a subjective judgment that involves very different-looking factors. That’s a hard thing to do, and it can make people uncomfortable.
W. Edwards Deming was famous for railing against numerical targets. Note that he wasn’t opposed to metrics. (He advocated for the value of professional statisticians and control charts). Rather, he was opposed to decisions that were made based on single metrics. Here are some quotes from his book Out of the crisis on this topic:
Focus on outcome (management by numbers, MBO, work standards, meet specifications, zero defects, appraisal of performance) must be abolished, leadership put in place.
Eliminate management by objective. Eliminate management by numbers, numerical goals. Substitute leadership.
[M]anagement by numerical goal is an attempt to manage without knowledge of what to do, and in fact is usually management by fear.
Deming uses the term “leadership” as the alternative to the decision-by-single-metric approach. I interpret that term as the ability of a manager to synthesize information from multiple sources in order to make a decision holistically. It’s a lot harder than mapping all of the factors into a single metric. But nobody ever said being an effective leader is easy.