Back when I was an engineering student, I wanted to know “How do the big companies develop software? How does it happen in the real world?”
Now that I work at a company that has to do large-scale software development, I understand better why it’s not something you can really teach effectively in a university setting. It’s not that companies doing large-scale software development are somehow better at writing software than companies that work on smaller-scale software projects. It’s that large-scale projects face challenges that small-scale projects don’t.
The biggest challenge at large-scale is coordination. My employer provides a single service, which means that, in theory, any project that anyone is working on inside of the company could potentially impact what anybody else is working on. In my specific case, I work on delivery tools, so we might be called upon to support some new delivery workflow.
You can take a top-down command-and-control style approach to the problem, by having the people at the top attempting to filter all of the information to just what they need, and them coordinating everyone hierarchically. However, this structure isn’t effective in dynamic environments: as the facts on the ground change, it takes too long for information to work its way up the hierarchy, adapt, and then change the orders downwards.
You can take a bottoms-up approach to the problem where you have a collection of teams that work autonomously. But the challenge there is getting them aligned. In theory, you hire people with good judgment, and provide them with the right context. But the problem is that there’s too much context! You can’t just firehose all of the available information to everyone, that doesn’t scale: everyone will spend all of their time reading docs. How do you get the information into the heads of the people that need it? becomes the grand challenge in this context.
It’s hard to convey the nature of this problem in a university classroom, if you’ve never worked in a setting like this before. The flurry of memos, planning documents, the misunderstandings, the sync meetings, the work towards alignment, the “One X” initiatives, these are all things that I had to experience viscerally, on a first-hand basis, to really get a sense of the nature of the problem.
Chris Pruett – On deciding to leave LinkedIn and co-founding Jam, values based decision making and compassionate leadership – #19 –
Chris Pruett is the CTO and Co-founder of Jam – a new way to share and listen to bite-sized audio. Prior to Jam, Chris spent 9+ years at LinkedIn growing from an engineering manager to VP of Engineering. During his tenure at LinkedIn, he worked on almost all aspects of the app and towards the end, led an org of 500+ engineers working on Feed, Messaging, Identity and Search. In this episode, we discuss how he made the decision to leave his leadership position at LinkedIn and co-found Jam. We also spoke about his time at LinkedIn and how he developed the practice to make value based decisions both in professional and personal life.
Alex Kessinger (Stitch Fix) and David Noël-Romas (Stripe) –
Jens Rasmussen was a giant in the field of safety science research. You can see still his influence on the field, in the writings of safety researchers such as Sidney Dekker, Nancy Leveson, and David Woods.
This model looks like it views the state of the system as a point in a state space. But, Rasmussen described it as a model of the humans working within the system. He used the term “work space” rather than “state space”. In addition, Rasmussen used the metaphor of a gas particle undergoing local random movements, a phenomenon known as Brownian also.
Along with the random movements, Rasmussen saw envisioned different forces (he called them gradients) that influenced how the work system would move within the work space. One of these forces was pressure from management to get more work done in order to make the company more profitable. Woods refers to this phenomenon as “faster/better/cheaper pressure“. This is the arrow labeled Management Pressure toward Efficiency, which pushes away from the Boundary to Economic Failure.
One way to get more work done is to give people increasing loads of work. But people don’t like having more and more work piled on them, and so there is opposing pressure from the workforce to reduce the amount of work they have to do. This is the arrow labeled Gradient toward Least Effort which pushes away from the Boundary to Unacceptable Work Load.
The result of those two pressures is movement towards what the diagram labels “the Boundary of functionally acceptable performance”. This is the safety boundary, and we don’t know exactly where it is, which is why there’s a second boundary in the diagram labelled “Resulting perceived boundary of acceptable performance.” Accidents happen when we cross the safety boundary.
Boundary according to Woods
In David Woods’s work, he also writes about the role of boundaries in system safety, but despite this surface similarity, his model isn’t the same as Rasmussen’s.
Instead of a work space, Woods refers to an envelope. He uses terms like competence envelope or design envelope or envelope of performance. Woods has done safety research in aviation, and so I suspect he was influenced by the concept of a flight envelope in aircraft design.
The flight envelope defines a region in a state space that the aircraft is designed to function properly within. You can see in the diagram above that the envelope’s boundaries are defined by the stall speed, top speed, and maximum altitude. Bad things happen if you try to operate an aircraft outside of the envelope (hence the phrase pushing the envelope).
Woods’s competence envelope is a generalization of the idea of flight envelope to other types of systems. Any system has a range of inputs that it can handle: if you go outside that range, bad things happen.
Summarizing the differences
To Rasmussen, there is only one boundary in the work space related to accidents: the safety boundary. The other boundaries in the space generally aren’t even reachable, because of the natural pressure away from them. To Woods, the competence envelope is defined by multiple boundaries, and crossing any of them can result in an accident.
Both Rasmussen and Woods identified the role of faster/better/cheaper pressure in accidents. To Rasmussen, this pressure resulted in pushing the system to the safety boundary. But to Woods, this pressure changes the behavior at the boundary. Woods sees this pressure as contributing to brittleness, to systems that don’t perform well as they get close to the boundary of the performance envelope. Woods’s current work focuses on how systems can avoid being brittle by having the ability of moving the boundary as they get closer to it: expanding the competence envelope. He calls this graceful extensibility.
Here’s a question that all of us software developers face: How can we best use our knowledge about the past behavior of our system to figure out where we should be investing our time?
One approach is to use a technique from the SRE world called error budgets. Here are a few quotes from the How to Use Error Budgets chapter of Alex Hidalgo’s book: Implementing Service Level Objectives:
Measuring error budgets over time can give you great insight into the risk factors that impact your service, both in terms of frequency and severity. By knowing what kinds of events and failures are bad enough to burn your error budget, even if just momentarily, you can better discover what factors cause you the most problems over time. p71 [emphasis mine]
The basic idea is straightforward. If you have error budget remaining, ship new features and push to production as often as you’d like; once you run out of it, stop pushing feature changes and focus on relaiability instead. p87
Error budgets give you ways to make decisions about your service, be it a single microservice or your company’s entire customer-facing product. They also give you indicators that tell you when you can ship features, what your focus should be, when you can experiment, and what your biggest risk factors are. p92
The goal is not to only react when your users are extremely unhappy with you—it’s to have better data to discuss where work regarding your service should be moving next. p354
That sounds reasonable, doesn’t it? Look at what’s causing your system to break, and if it’s breaking too often, use that as a signal to address those issues that are breaking it. If you’ve been doing really well reliability-wise, an error budget gives you margin to do some riskier experimentation in production like chaos engineering or production load testing.
I have two issues with this approach, a smaller one and a larger one. I’ll start with the smaller one.
First, I think that if you work on a team where the developers operate their own code (you-build-it, you-run-it), and where the developers have enough autonomy to say, “We need to focus more development effort on increasing robustness”, then you don’t need the error budget approach to help you decide when and where to spend your engineering effort. The engineers will know where the recurring problems are because they feel the operational pain, and they will be able to advocate for addressing those pain points. This is the kind of environment that I am fortunate enough to work in.
I understand that there are environments where the developers and the operators are separate populations, or the developers aren’t granted enough autonomy to be able to influence where engineering time is spent, and that in those environments, an error budget approach would help. But I don’t swim in those waters, so I won’t say any more about those contexts.
To explain my second concern, I need to digress a little bit to talk about Herbert Heinrich.
Herbert Heinrich worked for the Travelers Insurance Company in the first half of the twentieth century. In the 1920s, he did a study of workplace accidents, examining thousands of claims made by companies that held insurance policies with Travelers. In 1931, he published his findings in a book: Industrial Accident Prevention: A Scientific Approach.
Heinrich’s work showed a relationship between the rates of near misses (no injury), minor injuries, and major injuries. Specifically: for every major injury, there are 29 minor injuries, and 300 no-injury accidents. This finding of 1:29:300 became known as the accident triangle.
One implication of the accident triangle is that the rate of minor issues gives us insight into the rate of major issues. In particular, if we reduce the rate of minor issues, we reduce the risk of major ones. Or, as Heinrich put it: Moral—prevent the accidents and the injuries will take care of themselves.
Heinrich’s work has since been criticized, and subsequent research has contradicted Heinrich’s findings. I won’t repeat the criticisms here (see Foundations of Safety Science by Sidney Dekker for details), but I will cite counterexamples mentioned in Dekker’s book:
So, what does any of this have to do with error budgets? At a glance, error budgets don’t seem related to Heinrich’s work at all. Heinrich was focused on safety, where the goal is to reduce injuries as much as possible, in some cases explicitly having a zero goal. Error budgets are explicitly not about achieving zero downtime (100% reliability), they’re about achieving a target that’s below 100%.
Here are the claims I’m going to make:
Large incidents are much more costly to organizations than small ones, so we should work to reduce the risk of large incidents.
Error budgets don’t help reduce risk of large incidents.
Here’s Heinrich’s triangle redrawn:
An error-budget-based approach only provides information on the nature of minor incidents, because those are the ones that happen most often. Near misses don’t impact the reliability metrics, and major incidents blow them out of the water.
Heinrich’s work assumed a fixed ratio between minor accidents and major ones: reduce the rate of minor accidents and you’d reduce the rate of major ones. By focusing on reliability metrics as a primary signal for providing insight into system risk, you only get information about these minor incidents. But, if there’s no relationship between minor incidents and major ones, then maintaining a specific reliability level doesn’t address the issues around major incidents at all.
An error-budget-based approach to reliability implicitly assumes there is a connection between reliability metrics and the risk of a large incident. This is the thread that connects to Heinrich: the unstated idea that doing the robustness work to address the problems exposed by the smaller incidents will decrease the risk of the larger incidents.
In general, I’m skeptical about relying on predefined metrics, such as reliability, for getting insight into the risks of the system that could lead to big incidents. Instead, I prefer to focus on signals, which are not predefined metrics but rather some kind of information that has caught your attention that suggests that there’s some aspect of your system that you should dig into a little more. Maybe it’s a near-miss situation where there was no customer impact at all, or maybe it was an offhand remark made by someone in Slack. Signals by themselves don’t provide enough information to tell you where unseen risks are. Instead, they act as clues that can help you figure out where to dig to get more details. This is what the Learning from Incidents in Software movement is about.
I’m generally skeptical of metrics-based approaches, like error budgets, because they reify. The things that get measured are the things that get attention. I prefer to rely on qualitative approaches that leverage the experiment judgment of engineers. The challenge with qualitative approaches is that you need to expose the experts to the information they need (e.g., putting the software engineers on-call), and they need the space to dig into signals (e.g., allow time for incident analysis).
Over the past few weeks, I’ve had the experience multiple times where I’m communicating with someone in a semi-synchronous way (e.g., Slack, Twitter), and I respond to them without having properly understood what they were trying to communicate.
In one instance, I figured out my mistake during the conversation, and in another instance, I didn’t fully get it until after the conversation had completed, and I was out for a walk.
In these circumstances, I find that I’m primed to respond based on my expectations, which makes me likely to misinterpret. The other person is rarely trying to communicate what I’m expecting them to. Too often, I’m simply waiting for my turn to talk instead of really listening to what they are trying to say.
It’s tempting to blame this on Slack or Twitter, but I think this principle applies in all synchronous or semi-synchronous communications: including face-to-face conversations. I’ve certainly experienced this when I’ve been in technical interviews, where my brain is always primed to think, “What answer is the interviewer looking for to that question?”
John Allspaw uses the term soak time to refer to the additional time it takes us to process the information we’ve received in a post-incident review meeting, so we can make better decisions about what the next steps are. I think it describes this phenomenon well.
One technique to help with making a decision is to compute a single metric for each of the options being considered, and then compare the value of those two metrics. A common metric for this situation is to use dollars or ROI (return on investment, which is a unitless ratio of dollars). Are you trying to decide between two internal software development projects? Estimate the ROI for each one and pick the larger one. OKRs (objectives and key results) and error budgets are two other examples of driving decisions using individual metrics, like “where should we focus our effort now?” or “can we push this new feature to production?”
A single-metric-based approach has the virtue of simplifying the final stage in the decision-making process: we simply compare two numbers (either two metrics or a metric against a threshold) in order to make our decision. Yes, it requires mapping the different factors under consideration onto the metric, but it’s tractable, right?
The problem is that the process of mapping the relevant factors into the single metric always involves subjective judgments that ultimately discard information. For example, for ROI calculations, consider the work involved in considering the various different kinds of costs and benefits and mapping those into dollars. Information that should be taken into account when making the final decision vanishes from this process as these factors get mapped into an anemic scalar value.
The problem here isn’t the use of metrics. Rather, it’s the temptation to squeeze all of the relevant information into a form that is representable in a single metric. A single metric frees the decision maker from having to make a subjective judgment that involves very different-looking factors. That’s a hard thing to do, and it can make people uncomfortable.
W. Edwards Deming was famous for railing against numerical targets. Note that he wasn’t opposed to metrics. (He advocated for the value of professional statisticians and control charts). Rather, he was opposed to decisions that were made based on single metrics. Here are some quotes from his book Out of the crisis on this topic:
Focus on outcome (management by numbers, MBO, work standards, meet specifications, zero defects, appraisal of performance) must be abolished, leadership put in place.
Eliminate management by objective. Eliminate management by numbers, numerical goals. Substitute leadership.
[M]anagement by numerical goal is an attempt to manage without knowledge of what to do, and in fact is usually management by fear.
Deming uses the term “leadership” as the alternative to the decision-by-single-metric approach. I interpret that term as the ability of a manager to synthesize information from multiple sources in order to make a decision holistically. It’s a lot harder than mapping all of the factors into a single metric. But nobody ever said being an effective leader is easy.
The New York Times has a story today, Inside VW’s Campaign of Trickery, about how Volkswagon conspired to hide their excessive diesel emissions from California regulators.
What was fascinating to me was that the emission violations were discovered by mechanical engineering researchers at West Virginia University, Dan Carder, Hemanth Kappanna, and Marc Besch (Kappanna and Besch were graduate students at the time).
Productivity tools have always held a special fascination for me. I also tend to futz around with multiple tools, trying to find the perfect match for my workflow. My toolset has been pretty stable for several months now. Here’s what I’m currently using.
I’ve been a fan of Gettings Things Done for a long time. Of the various GTD-supporting tools I’ve found, I like OmniFocus the best. Useful features include:
Syncs well between laptop and phone
Easy to add to the inbox via keyboard shortcut in OS X
Easy to add to the inbox in the iOS app
Integrates with reminders on iOS, which means I can say to my watch “Remind me to do X” and “do X” ends up in my OmniFocus inbox
Per-project support for “serial tasks” (only one next action) and “parallel tasks” (multiple next actions). I use this all of the time.
I can put projects “on hold” and they don’t show up in current context. In particular, I have an on-hold “Someday” task which acts as a catch-all for things I don’t want to forget but that I don’t plan on doing in the near term.
My contexts are:
VoodooPad is a personal wiki. It mainly has two uses: context for each project I’m working on (e.g., pastes of recent error messages), and reference pages for things like urls and commonly used code snippets or commands that I often forget.
I like how it’s free-form, and not just plain text. This means I can paste in images that are rendered inline, and I can render code and terminal output in fixed-width font, and my notes in variable-width font.
That being said, what I’d really like is some content system that lets me organize by a topic and by date, and VoodooPad only does by topic, but it’s the closest I’ve been able to find.
Emergent Task Planner
I use a notebook called the Emergent Task Planner to structure my day. I write down tasks that I’d like to accomplish that day and schedule them in chunks of time. I often don’t follow the specific schedule, but I find it helps if I take some time to think about what I’m going to try to accomplish, as well as explicitly scheduling out time for checking email so I’m less tempted to do that while working.
Ubiquitous capture tools
Getting Things Done has a notion of “ubiquitious capture”: being able to quickly capture content that you can come back to later. In addition to OmniFocus, I use a few other tools for ubiquitious capture:
I keep a stack of index cards in my back pocket with a binder clip and along with a Fisher space pen. It’s often faster to scribble on an index card than to take out my phone. This was inspired by Merlin Mann’s Hipster PDA.
When I encounter a book or academic paper I’d like to read, I clip it to CiteULike.
If I encounter an essay on the web I don’t have time to read, I use Instapaper to capture it for later . It has great Kindle support: every week it automatically emails the content to my Kindle Paperwhite.
I use Pinboard to bookmark reference material. I was a Delicious user for a long time, but Pinboard’s UX is so much better, than I’m happy to pay them for it rather than use Delicious for free.
A Tesla driver was killed in a car crash while the Autopilot system was engaged. According to the news report:
Joshua D. Brown, of Canton, Ohio, died in the accident May 7 in Williston, Florida, when his car’s cameras failed to distinguish the white side of a turning tractor-trailer from a brightly lit sky and didn’t automatically activate its brakes, according to government records obtained Thursday.
These types of automative systems are completely outside my area of expertise. That being said, I imagine that validating this type of control system that relies on complex sensor data must be incredibly challenging. The input space is mind-bogglingly huge, so how do you catch these kinds of corner cases in testing?
The failure here is not due to a “bug” (or “defect” in academic software engineering jargon) in the traditional sense that we use the term. Yet, there clearly was a defect in this system, and the result was a human fatality.
I was also struck by this line:
Harley [an analyst at Kelley Blue Book] called the death unfortunate, but said that more deaths can be expected as the autonomous technology is refined.
I wonder if future deaths will lead to additional regulations on how software engineering work is done in domains like this.