Because coordination is expensive

If you’ve ever worked at a larger organization, stop me if you’ve heard (or asked!) any of these questions:

“Why do we move so slowly as an organization? We need to figure out how to move more quickly.”
“Why do we work in silos? We need to figure out how to break out of these.”
“Why do we spend so much of our time in meetings? We need to explicitly set no-meeting days so we can actually get real work done.”
“Why do we maintain multiple solutions for solving what’s basically the same problem? We should just standardize on one solution instead of duplicating work like this.”
“Why do we have so many layers of management? We should remove layers and increase span of control.”
“Why are we constantly re-org’ing? Re-orgs so disruptive.”

(As an aside, my favorite “multiple solutions” example is workflow management systems. I suspect that every senior-level engineer has contributed code to at least one home-grown workflow management system in their career).

The answer to all of these questions is the same: because coordination is expensive. It requires significant effort for a group of people to work together to achieve a task that is too large for them to accomplish individually. And the more people that are involved, the higher that coordination effort grows. This is both “effort” in terms of difficulty (effortful as hard), and in terms of time (engineering effort, as measured in person-hours). This is why you see siloed work and multiple systems that seem to do the same thing. It’s because it requires less effort to work within your organization then to coordinate across organization, the incentive is to do localized work whenever possible, in order to reduce those costs.

Time spent in meetings is one aspect of this cost, which is something people acutely feel, because it deprives them of their individual work time. But the meeting time is still work, it’s just unsatisfying-feeling coordination work. When was the last time you talked about your participation in meetings in your annual performance review? Nobody gets promoted for attending meetings, but we humans need them to coordinate our work, and that’s why they keep happening. As organizations grow, they require more coordination, which means more resources being put into coordination mechanisms, like meetings and middle management. It’s like an organizational law of thermodynamics. It’s why you’ll hear ICs at larger organizations talk about Tanya Reilly’s notion of glue work so much. You’ll hear companies run “One <COMPANY NAME>” campaigns at larger companies as an attempt to improve coordination; I remember the One SendGrid campaign back when I worked there.

Comic by ex-Googler Manu Cornet, 2021-02-18

Because of the challenges of coordination, there’s a brisk market in coordination tools. Some examples off the top of my head include: Gantt charts, written specifications, Jira, Slack, daily stand-ups, OKRs, kanban boards, Asana, Linear, pull requests, email, Google docs, Zoom, I’m sure you could name dozens more, including some that are no longer with us. (Remember Google Wave?). Heck, both spoken and written language are the ultimately communication ur-tools.

And yet, despite the existence of all of those tools, it’s still hard to coordinate. Remember back in 2002 when Google experimented with eliminating engineering managers? (“That experiment lasted only a few months“). And then in 2015 when Zappos experimented with holacracy? (“Flat on paper, hierarchy in practice.“) I don’t blame them for trying different approaches, but I’m also not surprised that these experiments failed. Human coordination is just fundamentally difficult. There’s no one weird trick that is going to make the problem go away.

I think it’s notable that large companies try different strategies to try to manage ongoing coordination costs. Amazon is famous for using a decentralization strategy, they have historically operated almost like a federation of independent startups, and enforce coordination through software service interfaces, as described in Steve Yegge’s famous internal Google memo. Google, on the other hand, is famous for using an invest-heavily-in-centralized-tooling approach to coordination. But there are other types of coordination that are outside of the scope of these sorts of solutions, such as working on an initiative that involves work from multiple different teams and orgs. I haven’t worked inside of either Amazon or Google, so I don’t know how well things work in practice there, but I bet employees have some great stories!

During incidents, coordination becomes an acute problem, and we humans are pretty good at dealing with acute problems. The organization will explicitly invest in an incident manager on-call rotation to help manage those communication costs. But coordination is also a chronic problem in organizations, and we’re just not as good at dealing with chronic problems. The first step, though, is recognizing the problem. Meetings are real work. That work is frequently done poorly, but that’s an argument for getting better at it. Because that’s important work that needs to get done. Oh, also, those people doing glue work have real value.

Amdahl, Gustafson, coding agents, and you

In the software operations world, if your service is successful, then eventually the load on it is going to increase to the point where you’ll need to give that services more resources. There are two strategies for increasing resources: scale up and scale out.

Scaling up means running the service on a beefier system. This works well, but you can only scale up so much before you run into limits of how large a machine you have access to. AWS has many different instance types, but there will come a time when even the largest instance type isn’t big enough for your needs.

The alternative is scaling out: instead of running your service on a bigger machine, you run your service on more machines, distributing the load across those machines. Scaling out is very effective if you are operating a stateless, shared-nothing microservice: any machine can service any request. It doesn’t work as well for services where the different machines need to access shared state, like a distributed database. A database is harder to scale out because the machines need to share state, which means they need to coordinate with each other.

Once you have to do coordination, you no longer get a linear improvement in capacity based on the number of machines: doubling the number of machines doesn’t mean you can handle double the load. This comes up in scientific computing applications, where you want to run a large computing simulation, like a climate model, on a large-scale parallel computer. You can run independent simulations very easily in parallel, but if you want to run an individual simulation more quickly, you need to break up the problem in order to distribute the work across different processors. Imagine modeling the atmosphere as a huge grid, and dividing up that grid and having different processors work on simulating different parts of the grid. You need to exchange information between processors at the grid boundaries, which introduces the need for coordination. Incidentally, this is why supercomputers have custom networking architectures, in order to try to reduce these expensive coordination costs.

In the 1960s, the American computer architect Gene Amdahl made the observation that the theoretical performance improvement you can get from a parallel computer is limited by the fraction of work that cannot be parallelized. Imagine you have a workload where 99% of the work is amenable to parallelization, but 1% of it can’t be parallelized:

Let’s say that running this workload on a single machine takes 100 hours. Now, if you ran this on an infinitely large supercomputer, the green part above would go from 99 hours to 0. But you are still left with the 1 hour of work that you can’t parallelize, which means that you are limited to 100X speedup no matter how large your supercomputer is. The upper limit on speedup based on the amount of the workload that is parallelizable is known today as Amdahl’s Law.

But there’s another law about scalability on parallel computers, and it’s called Gustafson’s Law, named for the American computer scientist John Gustafson. Gustafson observed that people don’t just use supercomputers to solve existing problems more quickly. Instead, they exploit the additional resources available in supercomputers to solve larger problems. The larger the problem, the more amenable it is to parallelization. And so Gustafson proposed scaled speedup as an alternative metric, which takes this into account. As he put it: in practice, the problem size scales with the number of processors.

And that brings us to LLM-based coding agents.

AI coding agents improve programmer productivity: they can generate working code a lot more quickly than humans can. As a consequence of this productivity increase, I think we are going to find the same result that Gustafson observed at Sandia National Labs: that people will use this productivity increase in order to do more work, rather than simply do the same amount of coding work with fewer resources. This is a direct consequence of the law of stretched systems from cognitive systems engineering: systems always get driven to their maximum capacity. If coding agents save you time, you’re going to be expected to do additional work with that newfound time. You launch that agent, and then you go off on your own to do other work, and then you context-switch back when the agent is ready for more input.

And that brings us back to Amdahl: coordination still places a hard limit on how much you can actually do. Another finding from cognitive systems engineering is that coordination costs, continually. Coordination work require requires continuous investment of effort. The path we’re on feels the work of software development is shifting from direct coding to a human coordinating with a single agent to a human coordinating work among multiple agents working in parallel. It’s possible that we will be able to fully automate this coordination work, by using agents to do the coordination. Steve Yegge’s Gas Town project is an experiment to see how far this sort of automated agent-based coordination can go. But I’m pessimistic on this front. I think that we’ll need human software engineers to coordinate coding agents for the foreseeable future. And the law of stretched systems teaches us that these multi-coding-agent systems are going to keep scaling up the number of agents until the human coordination work becomes the fundamental bottleneck.

My favorite developer productivity research method that nobody uses

You’ve undoubtedly heard of the psychological concept called flow state. This is the feeling you get when you’re in the zone, where you’re doing some sort of task, and you’re just really into it, and you’re focused, and it’s challenging but not frustratingly so. It’s a great feeling. You might experience this with a work task, or a recreational one, like when playing a sport or a video game. The pioneering researcher on the phenomenon of flow was the Hungarian-American psychologist Mihaly Csikszentmihalyi, and he wrote a popular book on the subject back in 1990 with the title Flow: The Psychology of Optimal Experience, which I read many years ago. But the one thing that stuck around most with me from Csikszentmihalyi’s book on Flow was the research method that he used to study flow.

One of the challenges of studying people’s experiences is that it’s difficult for researchers to observe them directly. This problem comes up when an organization tries to do research on the current state of developer productivity within the organizations. I harp on “make work visible” a lot because so much of the work we do in the software world is so hard for others to see. There are different data collection techniques that developer productivity researchers use, including surveys, interviews, focus groups, as well as automatic collection of metrics, like the DORA metrics. Of those, only the automatic collection of metrics focuses on in-the-moment data, and it’s a very thin type of data at that. Those metrics can’t give you any insights into the challenges that your developers are facing.

My preferred technique is the case study, which I try to apply to incidents. I like the incident case study technique because it gives us an opportunity to go deep into the nature of the work for a specific episode. But incident-as-case-study only works for, well, incidents, and while a well-done incident case study can shine a light on the nature of the development work, there’s also a lot that it will miss.

Csikszentmihalyi used a very clever approach which was developed by his PhD student Suzanne Prescott, called experience sampling. He gave the participants of his study pagers, and he would page them at random times. When paged, the participants would write down information about their experiences in a journal in-the-moment. In this way, he was able to collect information about subjective experience, without the problems you get when trying to elicit an account retrospectively.

I’ve never read about anybody trying to use this approach to study developer productivity, and I think that’s a shame. It’s something I’ve wanted to try myself, except that I have not worked in the developer productivity space for a long, long time.

These days, I’d probably use slack rather than a pager and journal to randomly reach out to the volunteers during the study and collect their responses, but the principle is the same. I’ve long wanted to capture an “are you currently banging your head against a wall” metric from developers, but with experience sampling, you could capture a “what are you currently banging your head against the wall about?”

Would this research technique actually work for studying developer productivity issues within an organization? I honestly don’t know. But I’d love to see someone try.

Note: I originally had the incorrect publication date for the Flow book. Thanks to Daniel Miller for the correction.

Tradeoff costs in communication

If you work in software, and I say the word server to you, which do you think I mean?

Software that responds to requests (e.g., http server)
A physical piece of hardware (e.g., a box that sits in a rack in a data center)
A virtual machine (e.g., an EC2 instance)

The answer, of course, is it depends on the context. The term server could mean any of those things. The term is ambiguous; it’s overloaded to mean different things in different contexts.

Another example of an overloaded term is service. From the end user’s perspective, the service is the system they interact with:

From the end user’s perspective, there is a single service

But if we zoom in on that box labeled service, it might be implemented by a collection of software components, where each component is also referred to as a service. This is sometimes referred to as a service-oriented architecture or a microservice architecture

A single “service” may be implemented by multiple “services”. What does “service” mean here?

Amusingly, when I worked at Netflix, people referred to microservices as “services”, but people also referred to all of Netflix as “the service”. For example, instead of saying, “What are you currently watching on Netflix?”, a person would say, “What are you currently watching on the service?”

Yet another example is the term “client”. This could refer to the device that the end-user is using (e.g., web browser, mobile app):

Or it could refer to the caller service in a microservice architecture:

It could also refer to the code in the caller service that is responsible for making the request, typically packaged as a client library.

The fact that the meaning of these terms is ambiguous and context-dependent makes it harder to understand what someone is talking about when the term is used. While the person speaking knows exactly what sense of server, service or client they mean, the person hearing it does not.

The ambiguous meaning of these terms creates all sorts of problems, especially when communicating across different teams, where the meaning of a term used by the client team of a service may be different from the meaning of the term used by the owner of a service. I’m willing to bet that you, dear reader, have experienced this problem at some point in the past when reading an internal tech doc or trying to parse the meaning of a particular slack message.

As someone who is interested in incidents, I’m acutely aware of the problematic nature of ambiguous language during incidents, where communication and coordination play an enormous role in effective incident handling. But it’s not just an issue for incident handling. For example, Eric Evans advocates the use of ubiquitous language in software design. He pushes for consistent use of terms across different stakeholders to reduce misunderstandings.

In principle, we could all just decide to use more precise terminology. This would make it easier for listeners to understand the intent of speakers, and would reduce the likelihood of problems that stem from misunderstandings. At some level, this is the role that technical jargon plays. But client, server and service are technical jargon, and they’re still ambiguous. So, why don’t we just use even more precise language?

The problem is that expressing ourselves in unambiguous isn’t free: it costs the speaker additional effort to be more precise. As a trivial example, microservice is more precise than service, but it takes twice as long as to say, and it takes an additional five letters to write. Those extra syllables and letters are a cost to the speaker. And, all other things being equal, people prefer expending less effort than more effort. This is why we don’t like being on the receiving end of ambiguous language, because we have to put more effort into resolving the ambiguity through context clues.

The cost of precision to the speaker is clear in the world of computer programming. Traditional programming languages require an extremely high degree of precision on behalf of the coder. This sets a very high bar for being able to write programs. On the other hand, modern generative AI tools are able to take natural language inputs as specifications which are orders of magnitude less precise, and turn them into programs. These tools are able to process ambiguous inputs in ways that regular programming languages simply cannot. The cost in effort is much lower for the vibe programmer. (I will leave evaluation of the outcomes of vibe programming as an exercise for the reader).

Ultimately, the degree in precision in communication is a tradeoff: an increase in precision means less effort for the listener and less risk of misunderstanding, at a cost of more effort for the speaker. Because of this tradeoff, we shouldn’t expect the equilibrium point to be at maximal precision. Instead, it’s somewhere in the middle. Ideally, it would be where we minimize the total effort. Now, I’m not a cognitive scientist, but this is a theory that has been advanced by cognitive scientists. For example, see the paper The communicative function of ambiguity in language by Steven T. Piantadosi, Harry Tily, and Edward Gibson. I touched on the topic of ambiguity more generally in a previous post the high cost of low ambiguity.

We often ask “why are people doing X instead of the obviously superior Y“. This is an example of how we are likely missing the additional costs of choosing to do Y over X. Just because we don’t notice those costs doesn’t mean they aren’t there. It means we aren’t looking closely enough.

The carefulness knob

A play in one act

Dramatis personae

EM, an engineering manager
TL, the tech lead for the team
X, an engineering manager from a different team

Scene 1: A meeting room in an office. The walls are adorned with whiteboards with boxes and arrows.

EM: So, do you think the team will be able to finish all of these features by end of the Q2?

TL: Well, it might be a bit tight, but I think it should be possible, depending on where we set the carefulness knob.

EM: What’s the carefulness knob?

TL: You know, the carefulness knob! This thing.

TL leans over and picks a small box off of the floor and places it on the table. The box has a knob on it with numerical markings.

EM: I’ve never seen that before. I have no idea what it is.

TL: As the team does development, we have to make decisions about how much effort to spend on testing, how closely to hew to explicitly documented processes, that sort of thing.

EM: Wait, aren’t you, like, careful all of the time? You’re responsible professionals, aren’t you?

TL: Well, we try our best to allocate our effort based on what we estimate the risk to be. I mean, we’re a lot more careful when we do a database migration than we do when we fix a typo in the readme file!

EM: So… um… how good are you at actually estimating risk? Wasn’t that incident that happened a few weeks ago related to a change that was considered a low risk at the time?

TL: I mean, we’re pretty good. But we’re definitely not perfect. It certainly happens that we misjudge the risk sometimes. I mean, in some sense, isn’t every incident in some sense a misjudgment of risk? How many times do we really say, “Hoo boy, this thing I’m doing is really risky, we’re probably going to have an incident!” Not many.

EM: OK, so let’s turn that carefulness knob up to the max, to make sure that the team is careful as possible. I don’t want any incidents!

LM: Sounds good to me! Of course, this means that we almost certainly won’t have these features done by the end of Q2, but I’m sure that the team will be happy to hear…

EM: What, why???

TL picks up a marker off of the table and walks up to the whiteboard. She draws an x axis and y-axis. She labels the x-axis “carefulness” and the y-axis “estimated completion time”.

TL: Here’s our starting point: the carefulness knob is currently set at 5, and we can properly hit end of Q2 if we keep it at this setting.

EM: What happens if we turn up the knob?

TL draws an exponential curve.

EM: Woah! That’s no good. Wait, if we turn the carefulness knob down, does that mean that we can go even faster?

TL: If we did that, we’d just be YOLO’ing our changes, not doing validation. Which means we’d increase the probability of incidents significantly, which end up taking a lot of time to deal with. I don’t think we’d actually end up delivering any faster if we chose to be less careful than we normally are.

EM: But won’t we also have more incidents at a carefulness setting of 5 than at higher carefulness settings?

TL: Yes, there’s definitely more of a risk that a change that we incorrectly assess as low risk ends up biting us at our default carefulness level. It’s a tradeoff we have to make.

EM: OK, let’s just leave the carefulness knob at the default setting.

Scene 2: An incident review meeting, two and a half months later.

X: We need to be more careful when we make these sorts of changes in the future!

Fin

Coda

It’s easy to forget that there is a fundamental tradeoff between how careful we can be and how much time it will take us to perform a task. This is known as the efficiency-thoroughness trade-off, or ETTO principle.

You’ve probably hit a situation where it’s particularly difficult to automate the test for something, and doing the manual testing is time-intensive, and you developed the feature and tested it, but then there was a small issue that you needed to resolve, and then do you go through all of the manual testing again? We make these sort of time tradeoffs in the small, they’re individual decisions, but they add up, and we’re always under schedule pressure to deliver.

As a result, we try our best to adapt to the perceived level of risk in our work. The Human and Organizational Performance folks are fond of the visual image of the black line versus the blue line to depict the difference between how the work is supposed to be done with how workers adapt to get their work done.

But sometimes these adaptations fail. And when this happens, inevitably someone says “we need to be more careful”. But imagine if you explicitly asked that person at the beginning of a project about where they wanted to set that carefulness knob, and they had to accept that increasing the setting would increase the schedule significantly. If an incident happened, you could then say to them, “well, clearly you set the carefulness knob too low at the beginning of this project”. Nobody wants to explicitly make the tradeoff between less careful and having a time estimate that’s seen as excessive. And so the tradeoff gets made implicitly. We adapt as best we can to the risk. And we do a pretty good job at that… most of the time.

Safety first!

I’m sure you’ve heard the slogan “safety first”. It is a statement of values for an organization, but let’s think about how to define what it should mean explicitly. Here’s how I propose to define safety first, in the context of a company. I’ll assume the company is in the tech (software) industry, since that’s the one I know best. So, in this context, you can think of “safety” as being about avoiding system outages, rather than about, say, avoiding injuries on a work site.

Here we go:

A tech company is a safety first company if any engineer has the ability to extend a project deadline, provided that the engineer judges in the moment that they need additional time in order to accomplish the work more safely (e.g., by following an onerous procedure for making a change, or doing additional validation work that is particular time-intensive).

This ability to extend the deadline must be:

automatic
unquestioned
consequence-free

Automatic. The engineer does not to explicitly ask someone else for permission before extending the deadline.

Unquestioned. Nobody is permitted to ask the engineer “why did you extend the deadline?” after-the-fact.

Consequence-free. This action cannot be held against the engineer. For example, it cannot be a factor in a performance review.

Now, anyone who has worked in management would say to me, “Lorin, this is ridiculous. If you give people the ability to extend deadlines without consequence, then they’re just going to use this constantly, even if there isn’t any benefit to safety. It’s going to drastically harm the organization’s ability to actually get anything done”.

And, the truth is, they’re absolutely right. We all work under deadlines, and we all know that if there was a magical “extend deadline” button that anyone could press, that button would be pressed a lot, and not always for the purpose of improving safety. Organizations need to execute, and if anybody could introduce delays, this would cripple execution.

But this response is exactly the reason why safety first will always be a lie. Production pressure is an unavoidable reality for all organizations. Because of this, the system will always push back against delays, and that includes delays for the benefit of safety. This means engineers will always face double binds, where they will feel pressure to execute on schedule, but will be punished if they make decisions that facilitate execution but reduce safety.

Safety is never first in organization: it’s always one of a number of factors that trade off against each other. And those sorts of tradeoff decisions happen day-to-day and moment-to-moment.

Remember that the next time someone is criticized for “not being careful enough” after a change brings down production.

When there’s no gemba to go to

I’m finally trying to read through some Toyota-related books to get a better understanding of the lean movement. Not too long ago, I read Sheigo Shingo’s Non-Stock Production: The Shingo System of Continuous Improvement, and sitting on my bookshelf for a future read is James Womack, Daniel Jones, and Daniels Roos’s The Machine That Changed the World: The Story of Lean Production.

The Toyota-themed book I’m currently reading is Mike Rother’s Toyota Kata: Managing People for Improvement, Adaptiveness and Superior Results. Rother often uses the phrase “go and see”, as in “go to the shop floor and observe how the work is actually being done”. I’ve often heard lean advocates use a similar phrase: go the gemba, although Rother himself doesn’t use it in his book. There’s a good overview at the Lean Enterprise Institute’s web page for gemba:

Gemba (現場) is the Japanese term for “actual place,” often used for the shop floor or any place where value-creating work actually occurs. It is also spelled genba. Lean Thinkers use it to mean the place where value is created. Japanese companies often supplement gemba with the related term “genchi gembutsu” — essentially “go and see” — to stress the importance of empiricism.

The idea of focusing on understanding work-as-done is a good one. Unfortunately, in software development in particular, and knowledge work in general, the place that the work gets done is distributed: it happens wherever the employees are sitting in front of their computers. There’s no single place, no shop floor, no gemba that you can go to in order to go and see the work being done.

Now, you can observe the effects of the work, whether it’s artifacts generated (pull requests, docs), or communication (slack messages, emails). And you can talk to people about the work that they do. But, it’s not like going to the shop floor. There is no shop floor.

And it’s precisely because we can’t go to the gemba that incident analysis can bring so much value, because it allows you to essentially conduct a miniature research project to try to achieve the same goal. You get granted some time (a scarce resource!) to reconstruct what happened, by talking to people and looking at those work products generated over time. If we’re good at this, and we’re lucky, we can get a window into how the real work happens.

Making peace with the imperfect nature of mental models

We all carry with us in our heads models about how the world works, which we colloquially refer to as mental models. These models are always incomplete, often stale, and sometimes they’re just plain wrong.

For those of us doing operations work, our mental models include our understanding of how the different parts of the system work. Incorrect mental models are always a factor in incidents: incidents are always surprises, and surprises are always discrepancies between our mental models and reality.

There are two things that are important to remember. First, our mental models are usually good enough for us to do our operations work effectively. Our human brains are actually surprisingly good at enabling us to do this stuff. Second, while a stale mental model is a serious risk, none of us have the time to constantly verify that all of our mental models are up to date. This is the equivalent of popping up an “are you sure?” modal dialog box before taking any action. (“Are you sure that pipeline that always deploys to the test environment still deploys to test first?”)

Instead, because our time and attention is limited, we have to get good at identifying cues to indicate that our models have gotten stale or are incorrect. But, since we won’t always get these cues, it’s inevitable that our mental models will go out of date. But that’s just an inevitable part of the job when you work in a dynamic environment. And we all work in dynamic environments.

Incident categories I’d like to see

If you’re categorizing your incidents by cause, here are some options for causes that I’d love to see used. These are all taken directly from the field of cognitive systems engineering research.

Production pressure

All of us are so often working near saturation: we have more work to do than time to do it. As a consequence, we experience pressure to get that work done, and the pressure affects how we do our work and the decisions we make. Multi-tasking is a good example of a symptom of production pressure.

Ask yourself “for the people whose actions contributed to the incident, what was their personal workload like? How did it shape their actions?”

Goal conflicts

Often we’re trying to achieve multiple goals while doing our work. For example, you may have a goal to get some new feature out quickly (production pressure!), but you also have a goal to keep your system up and running as you make changes. This creates a goal conflict around how much time you should put into validation: the goal of delivering features quickly pushes you towards reducing validation time, and the goal of keeping the system up and running pushes you towards increasing validation time.

If someone asks “Why did you take action X when it clearly contravenes goal G?”, you should ask yourself “was there another important goal, G1, that this action was in support of?”

Workarounds

How do you feel about the quality of the software tools that you use in order to get your work done? (As an example: how are the deployment tools in your org?)

Often the tools that we use are inadequate in one way or another, and so we resort to workarounds: getting our work done in a way that works but is not the “right” way to do it (e.g., not how the tool was designed to be used, against the official process of how to do things). Using workarounds is often dangerous because the system wasn’t designed with that type of work in mind. But if the dangerous way of doing work is the only way that the work can get done, then you’re going to end up with people taking dangerous actions.

If an incident involves someone doing something they weren’t “supposed to”, you should ask yourself, “did they do it this way because they are working around some deficiency in the tools that have to use?”

Automation surprises

Software automation often behaves in ways that people don’t expect: we have incorrect mental models of why the system is doing what it is, often because the system isn’t designed in a way to make it easy for us to form good mental models of behavior. (As someone who works on a declarative deployment system, I acutely feel the pain we can inflict on our users in this area).

If someone took the “wrong” action when interacting with a software system in some way, ask yourself “what was their understanding of the state of the world at the time, and what was their understanding of what the result of that action would be? How did they form their understanding of the system behavior?”

Do you find this topic interesting? If so, I bet you’ll enjoy attending the upcoming Learning from Incidents Conference taking place on Feb 15-16, 2023 in Denver, CO.

Writing docs well: why should a software engineer care?

Recently I gave a guest lecture in a graduate level software engineering course on the value of technical writing for software engineers. This post is a sort of rough transcript of my talk.

I live-sketched my slides as I went.

I talked about three goals of doing doing technical writing.

The first one is about building shared understanding among stakeholders of a document. One of the hardest problems in software engineering is getting multiple people to have a sufficient understanding of some technical aspect, like the actual problem being solved, or a proposed solution. This is ostensibly the only real goal of technical writings.

Shared understanding is related to the idea of common ground that you’ll sometimes hear the safety folks talk about.

If you’re a programmer who works completely alone, then this is a problem you generally don’t have to solve, because there’s only one person involved in the software project.

But as soon as you are working in a team, then you have to address the problem of shared understanding.

When we work on something technical, like software, we develop a much deeper understanding because we’re immersed in it. This can make communication hard when we’re talking to someone who hasn’t been working in the same area and so doesn’t have the same level of technical understanding of that particular bit.

If you’re working only with a small, co-located group (e.g., in a co-located startup), then having a discussion in front of a whiteboard is a very effective mechanism for building shared understanding. In this type of environment, writing effective technical docs is much less important.

The problem with the discuss-in-front-of-the-whiteboard approach is that it doesn’t scale up, and it also doesn’t work for distributed environments.

And this is where technical documents come in.

I like to say that the hardest problem in software engineering is getting the appropriate information into the heads of the people who need to have that information in order to do their work effectively.

In large organizations, a lot of the work is interconnected, which means that some work that somebody else is doing can affect your work. If you’re not aware of that, you can end up working at cross-purposes.

The challenge is that there’s so much potential information that might be useful. Everyone could potentially spend all of their working hours reading docs, and still not read everything that might be relevant.

To write a doc well means to get people to gain sufficient understanding so that you can coordinate work effectively.

The second goal of writing I talked about was using writing to help with your own thinking.

The cartoonist Richard Guindon has a famous quote: “writing is nature’s way of letting you know how sloppy your thinking is.” You might have an impression that you understand something well, but that sense of clarity is often an illusion, and when you go to explicitly capture your understanding in a document, you discover that you didn’t understand things as well as you thought. There’s nowhere to hide in your own document.

When writing technical docs, I always try hard to work explicitly through examples to demonstrate the concepts. This is one of the biggest weaknesses I see in practice in technical docs, that the author has not described a scenario from start to finish. Conceptually, you want your doc to have something like a storyboard that’s used in the film industry, to tell the story. Writing out a complete example will force you to confront the gaps in your understanding.

The third goal is a bit subversive: it’s how to use effective technical writing to have influence in a larger organization when you’re at the bottom of the hierarchy.

If you want influence, you likely have some sort of vision of where you want the broader organization to go, and the challenge is to persuade people of influence to move things closer to your vision.

Because senior leadership, like everyone else in the organization, only has a finite amount of time and attention, their view of reality are shaped by the interactions they do have: which is largely through meetings and documents. Effective technical documents shape the view of reality that leadership has, but only if they’re written well.

If you frame things right, you can make it seem as if your view is reality rather than simply your opinion. But this requires skill.

Software engineers often struggle to write effective docs. And that’s understandable, because writing effective technical docs is very difficult.

Anyone who has set down at a computer to write a doc and has stared at the blinking cursor at an empty doc knows how difficult it can be to just get started.

Even the best-written technical docs aren’t necessarily easy to read.

Poorly written docs are hard to read. However, just because a doc is hard to read, doesn’t mean it’s poorly written!

This talk is about technical writing, but technical reading is also a skill. Often, we can’t understand a paragraph in a technical document without having a good grasp of the surrounding context. But we also can’t understand the context without reading the individual paragraphs, not only of this document, but of other documents as well!

This means we often can’t understand a technical document by reading from beginning to end. We need to move back and forth between working to understand the text itself and working to understand the wider context. This pattern is known as the hermeneutic circle, and it is used in Biblical studies.

Finally, some pieces of advice on how to improve your technical writing.

Know explicitly in advance what your goal is in doing the writing. Writing to improve your own understanding is different from writing to improve someone else’s understanding, or to persuade someone else.

Make sure your technical document has concrete examples. These are the hardest to write, but they are most likely to help achieve your goals in your document.

Get feedback on your drafts from people that you trust. Even the best writers in the world benefit from having good editors.