I can’t bring myself to read text if I believe it to be AI-generated.
Now, I ask LLMs questions all of the time, and I do read those answers. I frequently use tools like ChatGPT and Claude as replacements for Google for answering specific questions; that’s not what I’m talking about here. I’m also not talking about reading LLM-generated code. What I mean is, if I’m reading some sort of a document, if I suspect that the document was AI-generated, my motivation to read through it drops down to approximately zero. If I was browsing non-fiction books in a bookstore, and a book was marked as having been AI-generated, I wouldn’t pick it up.
Being honest with myself, I think this point of view is irrational. My personal primary goal for reading any sort of non-fiction document is to advance my understanding of a topic, or put new ideas into my head. In principle, it shouldn’t matter whether the words on the page were emerged from the thoughts of a human being or via an autoregressive stochastic process. In addition, I’m very far from being a perfect detector of AI-generated text, so there isn’t even a way for me to know whether a particular document I’m reading came from a human or a machine. Also, AI generation is a spectrum. No document is completely AI generated: they all start with a prompt was written by a human. Some texts will have been iteratively generated by a collaboration of human and AI. If I knew that an author had used AI as a copyeditor, or to tighten up some of their sentences, that wouldn’t bother me at all. There’s not some magical threshold in my head about how much AI assistance I would consider to be OK, nor would it ever be possible for me to know whether that threshold was exceeded unless the author explicitly told me.
And yet, despite knowing this, I’m just turned off by reading anything that strikes me as being AI-generated. If I’m asked to read a design document, and I suspect the doc was written by AI, I need to fight myself to actually get through it. I feel like a writer should always spend more time generating a document than a reader should spend consuming it, and asking me to spend more time on understanding something that someone else didn’t put the effort into writing feels like a violation of an implicit contract.
As I said, I think this is an irrational response. And I expect the quality of LLM writing to continue to improve over time, so that we stop referring to it as slop. But there’s just something, well, soulless about the idea of writing generated by a machine.
This past week at SREcon 2026 Americas, I gave a plenary talk titled The Power of Stories. I referenced several books and papers in that talk, which are linked below.
I was listening to Todd Conklin’s Pre-Accident Investigation Podcast the other day, to the episode titled When Normal Variability Breaks: The ReDonda Story. The name ReDonda in the title refers to ReDonda Vaught, an American registered nurse. In 2017, she was working at the Vanderbilt University Medical Center in Nashville when she unintentionally administered the wrong drug to a patient under her care, a patient who later died. Vaught was fired, then convicted by the state of Tennessee for criminally negligent homicide and abuse of an impaired adult. It’s a terrifying story, really a modern tale of witch-burning, but it’s not what this post is about. Instead, I want to home in a term from the podcast title: normal variability.
In the context of the field of safety, the term variability refers to how human performance is, well, variable. We don’t always do the work the exact same way. This variation happens between humans, where different people will do work in different ways. And the variation also happens within humans, the same person will perform a task differently over time. The sources of variation in human performance are themselves varied: level of experience, external pressures being faced by the person, number of hours of sleep the night before, and so on.
In the old view of safety, there is an explicitly safe way to perform the work, as specified in documented procedures. Follow the procedures, and incidents won’t happen. In the software world, these procedures might be: write unit tests for new code, have the change reviewed by a peer, run end-to-end tests in staging, and so on. Under this view of the world, variability is necessarily a bad thing. Since variability means people do work differently, and since safety requires doing work the proscribed way, human variability is a source of incidents. Traditional automation doesn’t have this variability problem: it always does the work the same way. Hence you get the old joke:
The factory of the future will have only two employees: a man and a dog. The man will be there to feed the dog.The dog will be there to keep the man from touching the equipment.
In the new view of safety, normal variability is viewed as an asset rather than a liability. In this view, the documented procedures for doing the work are always inadequate, they can never capture all of the messy details of real work. It is the human ability to adapt, to change the way that they do the work based on circumstances, that creates safety. That’s why you’ll hear resilience engineering folks use the (positive) term adaptive capacity rather than the (more neutral) human variability, to emphasize that human variability is, quite literally, adaptive. This is why tech companies still staff on-call rotations even though they have complex automation that is supposed to keep things up and running. It’s because the automation can never handle all of the cases that the universe will throw at it. Even sophisticated automation always eventually proves too rigid to be able to handle some particular circumstance that was never foreseen by the designers. This is the perfect-storm, weird-edge-case stuff that post-incident write-ups are made of.
This, again, brings us back to AI.
My own field of software development is being roiled by the adoption of AI-based coding tools like Anthropic’s Claude Code, OpenAI’s Codex, and Google’s Gemini Code Assist. These AI tools are rapidly changing the way that software is being developed, and you can read many blog posts of early adopters who are describing their experiences using these new tools. Just this week, there was a big drop in the market value of multiple software companies; I’ve already seen references to the beginning of the SaaS-Pocalypse, the idea being that companies will write bespoke tools using AI rather than purchasing software from vendors. The field of software development has seen a lot of change in terms of tooling in my own career, but one thing that is genuinely different about these AI-based tools is that they are inherently non-deterministic. You interact with these tools by prompting them, but the same prompt yields different results.
Non-determinism in software development tools is seen as a bad thing. The classic example of non-determinism-as-bad is flaky tests. A flaky test is non-deterministic: the same input may lead to a pass or a fail. Nobody wants non-determinism like this in our test suite. On the build side of things, we hope that our compiler emits the same instructions given the same source file and arguments. There’s even a whole movement around reproducible builds, the goal of which is to stamp out all of the non-determinism in the process of producing binaries from the original source code, where the ideal is achieving bit-for-bit identical binaries. Unsurprisingly, then, the non-determinism of the current breed of AI coding tools is seen as a problem. Here’s a quote from a recent article in the Wall Street Journal by Chip Cutter and Sebastian Herrera: Here’s Where AI Is Tearing Through Corporate America:
Satheesh Ravala is chief technology officer of Candescent, which makes digital technology used by banks and credit unions. He has fielded questions from employees about what innovations like Anthropic’s new features mean for the company, and responded by telling them banks rely on the company for software that does exactly what it’s supposed to every time—something AI struggles with.
“If I want to transfer $10,” he said, “it better be $10 not $9.99.”
I believe the AI coding tools are only going to improve with time, though I don’t feel confident in predicting whether future improvements will be orders-of-magnitude or merely incremental. What I do feel confident in predicting is that the non-determinism in these tools isn’t going away.
At their heart, these tools are sophisticated statistical models: they are prediction machines. When you’re chatting with one, it is predicting the next word to say, and then it feeds back the entire conversation so far, predicts the next word to say again, and so on. Because they are statistical models, there is some probability distribution of next word to predict. You could build the system to always choose the most likely word to say next. Statistical models aren’t just an AI thing, and many statistical models do use such a maximum likelihood approach. But that’s not what LLMs do in general. Instead, there’s some randomness that is intentionally injected into the system so that it doesn’t always just pick the most likely next word, but instead does a biased random selection of the next word, based on the statistical model of what’s most likely to come next, and based on a parameter called temperature, drawing an analogy to physics. If the temperature is zero, then the system always outputs the most likely next word. The higher the temperature, the more random the selection is.
What’s fascinating to me about this is the deliberate injection of randomness improved the output of the models, as judged qualitatively by humans. In other words, increasing the variability of the system improved outcomes.
Now, these LLMs haven’t achieved the level of adaptability that humans possess, though they can certainly perform some impressive cognitive tasks. I wouldn’t say they have adaptive capacity, and I firmly believe that humans will still need to be on-call for software system for the remainder of my career, despite the proliferation of AI SRE solutions. But what I am saying instead is that the ability of LLMs to perform cognitive tasks well depends upon them being able to leverage variability. And my prediction is that this dependence on variability isn’t going to go away. LLMs will get better, and they might even get much better, but I don’t think they’ll ever be deterministic. I think variability is an essential ingredient for a system to be able to perform these sorts of complex cognitive tasks.
Here’s another blog post on gathering some common threads from reading recent posts. Today’s topic is about the unassuming nature of talented software engineers.
The first thread was a tweet by Mitchell Hashimoto about how his best former colleagues are ones where you would have no signal about their skills based on their online activities or their working hours.
One of the most impressive people I've ever worked with was a guy who spent a decade prior working on the same team at the same company iterating on a kernel driver for a single specific network card. He clocked in at 9 and out at 5. Predictable promotions. Nothing crazy.
The second thread was a blog post written a week later by Nikunj Kothari titled The Quiet Ones: Working within the seams. In this post, Kothari wasn’t writing about a specific engineer per se, but rather a type of engineer, one whose contributions aren’t captured by the organization’s performance rubric (emphasis mine):
They don’t hit your L5 requirements because they’re doing L3 and L7 work simultaneously. Fixing the deploy pipeline while mentoring juniors. Answering customer emails while rebuilding core systems. They can’t be ranked because they do what nobody thought to measure.
One of the best staff-level engineers I worked with is on the market. … What you need to know about this person: every team he’s ever worked on, he did standout work, in every situation. He got stuff done with high quality, helped others, is not argumentative but is firm in holding up common sense and practicality, and is very curious and humble to top all of this off. … And still, from the outside, this engineer is near completely invisible.
He has no social media footprint. His LinkedIn lists his companies he worked at, and nothing else: no technologies, no projects, nothing. His GitHub is empty for the last 5 years, and has perhaps a dozen commits throughout the last 10.
That reason that Mitchell Hashimoto, NIkunj Kothari, and Gergly Orosz were able to identify these talented colleagues as because they worked directly with them. People making hiring decisions don’t have that luxury. For promotions, there are organizational constraints that push organizations to define a formal process with explicit criteria.
For both hiring and promotion, decision-makers have a legibility problem. This problem will inevitability lead to a focus on details that are easier to observe directly precisely because they are easier to observe directly. This is how fields like graphologyand phrenology come about. But just because we can directly observe someone’s handwriting or the shapes of the bumps on their head doesn’t mean that those are effective techniques for learning something about that person’s personality.
I think it’s unlikely the industry will get much better at identifying and evaluating candidates anytime soon. And so I’m sure we’ll continue to see posts about the importance of your LinkedIn profile, or your GitHub, or your passion project. But you neglect at your peril the engineers who are working nine-to-five days at boring companies.
There are software technologies that work really well in-the-small, but they don’t scale up well. The challenge here is that the problem size grows incrementally, and migrating off of them requires significant effort, and so locally it makes sense it to keep using them, but then you reach a point where you’re well into the size where they are a liability rather than an asset. Here are some examples.
Shell scripts
Shell scripts are fantastic in the small: throughout my career, I’ve written hundreds and hundreds of bash scripts that are twenty lines are less, typically closer than to ten, frequently less than even five lines. But, as soon as I need to write an if statement, that’s a sign to me that I should probably write it in something like Python instead. Fortunately, I’ve rarely encountered large shell scripts in the wild these days, with DevStack being a notable exception.
Makefiles
I love using makefiles as simple task runners. In fact, I regularly use just, which is like an even simpler version of make, and has similar syntax. And I’ve seen makefiles used to good effect for building simple Go programs.
But there’s a reason technologies like Maven, Gradle, and Bazel emerged, and it’s because large-scale makefiles are an absolute nightmare. Someone even wrote a paper called Recursive Make Considered Harmful.
YAML
I’m not a YAML hater, I actually like it for configuration files that are reasonably sized, where “reasonably sized” means something like “30 lines or fewer”. I appreciate support for things like comments and not having to quote strings.
However, given how much of software operations runs on YAML these days, I’ve been burned too many times by having to edit very large YAML files. What’s human-readable in the small isn’t human-readable is the large.
Spreadsheets
The business world runs on spreadsheets: they are the biggest end-user programming success story in human history. Unfortunately, spreadsheets sometimes evolve into being de facto databases, which is terrifying. The leap required to move from using a spreadsheet as your system of record to a database is huge, which explains why this happens so often.
The late science fiction author Arthur C. Clarke had a great line: Any sufficiently advanced technology is indistinguishable from magic. (This line inspired the related observation: any sufficiently advanced technology is indistinguishable from a rigged demo). Clarke was referring to scenarios where members of a civilization encounters technology developed by a different civilization. The Star Trek: The Next Generation episode titled Who Watches The Watchers is an example of this phenomenon in action. The Federation is surreptitiously observing the Mintakans, a pre-industrial alien society, when Federation scientists accidentally reveal themselves to the Mintakans. When the Mintakans witness Federation technology in action, they come to the conclusion that Captain Picard is a god.
LLMs are the first time I’ve encountered a technology that was developed by my own society where I felt like it was magic. Not magical in the “can do amazing things” sense, but magical in the “I have no idea how it even works” sense. Now, there’s plenty of technology that I interact with on a day-to-day basis that I don’t really understand in any meaningful sense. And I don’t just mean sophisticated technologies like, say, cellular phones. Heck, I’d be hard pressed to explain to you precisely how a zipper works. But existing technology feels in principle understandable to me, that if I was willing to put in the effort, I could learn how it works.
But LLMs are different, in the sense that nobody understands how they work, not even the engineers who designed them. Consider the human brain as an analogy for a moment: at some level, we understand how the human brain works, how it’s a collection of interconnected neuron cells arranged in various structures. We have pretty good models of how individual neurons behave. But if I asked you “how is the concept of thenumber two encoded in a human brain?”, nobody today could give a satisfactory answer to that. It has to be represented in there somehow, but we don’t quite know how.
Similarly, at the implementation level, we do understand how LLMs work: how words are encoded as vectors, how they are trained using data to do token prediction, and so on. But these LLMs perform cognitive tasks, and we don’t really understand how they do that via token predction. Consider this blog post from Anthropic from last month: Tracing the thoughts of a large language model. It talks about two research papers published by Anthropic where they are trying to understand how Claude (which they built!) performs certain cognitive tasks. They are trying to essentially reverse-engineer a system that they themselves built! Or, to use the analogy they use explicitly in the post, they are doing AI biology, they are approaching the problem of how Claude performs certain tasks the way that a biologist would approach the problem of how a particular organism performs a certain function.
Now, engineering researchers routinely study the properties of new technologies that humans have developed. For example, engineering researchers had to study the properties of solid-state devices like transistors, they didn’t know what those properties were just because they created them. But that’s different from the sort of reverse engineering kind of research that the Anthropic engineers are doing. We’ve built something to perform a very broad set of tasks, and it works (for various value of “works”), but we don’t quite know how. I can tell you exactly how a computer encodes the number two in either integer form (using two’s complement encoding) or in floating point form (using IEEE 754 encoding). But, just as I could not tell you how the human brain encodes the number two as concept, I could not tell you how Claude encodes the number two as a concept. I don’t even know if “concept” is a meaningful, well, concept, for LLMs.
There are two researchers who have won both the Turing Award and the Nobel Prize. The most recent winner is Geoffrey Hinton, who did foundational work in artificial neural networks, which eventually led to today’s LLMs. The other dual winner was also an AI researcher: Herbert Simon. Simon wrote a book called The Sciences of the Artificial, about how we should study artificial phenomena.
And LLMs are certainly artificial. We can argue philosophically about whether concepts in mathematics (e.g., the differential calculus) or theoretical computer science (e.g., the lambda calculus) are invented or discovered. But LLMs are clearly a human artifact, I don’t think anybody would argue that we “discovered” them. LLMs are a kind of black-box model of human natural language. We examine just the output of humans in the form of written language, and try to build a statistical model of it. Model here is a funny word. We generally think of models as a simplified view of reality that we can reason about: that’s certainly how scientists use models. But an LLM isn’t that kind of model. In fact, their behavior is so complex, that we have to build models of the model in order to do the work of trying to understand it. Or as the authors of one of the Anthropic papers puts it in On the Biology of a Large Language Model: Our methods study the model indirectly using a more interpretable “replacement model,” which incompletely and imperfectly captures the original.
As far as I’m aware, we’ve never had to do this sort of thing before. We’ve never engineered systems in such a way that we don’t fundamentally understand how they work. Yes, our engineered world contains many complex systems where nobody really understands how the entire system works, I write about that frequently in this blog. But I claim that this sort of non-understanding of LLMs on our part is different in kind from our non-understanding of complex systems.
Unfortunately, the economics of AI obscures the weirdness of the technology. There’s a huge amount of AI hype going on as VCs pour money into AI-based companies, and there’s discussion of using AI to replace humans for certain types of cognitive work. These trends, along with the large power consumption required by these AI models have, unsurprisingly, triggered a backlash. I’m looking forward to the end of the AI hype cycle, where we all stop talking about AI so damned much, when it finally settles in to whatever the equilibrium ends up being.
But I think it’s a mistake to write off this technology as just a statistical model of text. I think the word “just” is doing too much heavy lifting in that sentence. Our intuitions break down when we encounter systems beyond the scales of everyday human life, and LLMs are an example of that. It’s like saying “humans are just a soup of organic chemistry” (c.f. Terry Bisson’s short story They’re Made out of Meat). Intuitively, it doesn’t seem possible that evolution by natural selection would lead to conscious beings. But, somehow we humans are an emergent property of long chains of amino acids recombining, randomly changing, reproducing, and being filtered out by nature. The scale of evolution is so unimaginably long that our intuition of what evolution can do breaks down: we probably wouldn’t believe that such a thing was even possible if the evidence in support of it wasn’t so damn overwhelming. It’s worth noting here that one of the alternative approaches to AI was inspired by evolution by natural selection: genetic algorithms. However, this approach has proven much less effective than artificial neural networks. We’ve been playing with artificial neural networks on computers since the 1950s, and once we scaled up those artificial neural networks with large enough training sets and a large enough set of parameters, and we hit upon effective architectures, we achieved qualitatively different results.
Here’s another example of how our intuitions break down at scales outside of our immediate experience, this one borrowed from the philosophers Paul and Patricia Churchland in their criticism of John Searle’s Chinese Room argument. The Churchlands ask us to imagine a critic of James Clerk Maxwell’s electromagnetic theory by taking a magnet, shaking it backwards and forth, seeing no light emerge from the shaken magnet, and concluding that Maxwell’s theory is incorrect. Understanding the nature of light is particularly challenging for us humans, because it behaves at scales outside of the typical human ones, our intuitions are a hindrance rather than a help.
Just look at this post by Simon Willison about Claude’s system prompt. Ten years ago, if you had told me that a software company was configuring their behavior of their system with a natural language prompt, I would have laughed at you and told you, “that’s not how computers work.” We don’t configure conventional software by guiding it with English sentences and hoping that pushes it in a direction that results in more desirable outcomes. This is much closer to Isaac Asimov’s Three Laws of Robotics than we are to setting fields in a YAML file. According to my own intuitions, telling a computer in English how to behave shouldn’t work at all. And yet, here we are. It’s like the old joke about the dancing bear: it’s not that it dances well, but that it dances at all. I am astonished by this technology.
So, while I’m skeptical of the AI hype, I’m also skeptical of the critics that dismiss the AI technology too quickly. I think we just don’t understand this new technology well enough to know what it is actually capable of. We don’t know whether changes in LLM architecture will lead to only incremental improvements or could give us another order of magnitude. And we certainly don’t know what’s going to happen when people attempt to leverage the capabilities of this new technology.
The only thing I’m comfortable predicting is that we’re going to be surprised.
Postscript: I don’t use LLMs for generating the texts in my blog posts, because I use these posts specifically to clarify my own thinking. I’d be willing to use it as a copy-editor, but so far I’ve been unimpressed with WordPress’s “AI assistant: show issues & suggestions” feature. Hopefully that gets better over time.
I do find LLMs to often give me better results than search engines like Google or DuckDuckGo, but it’s still hit or miss.
For doing some of the research for this post, Claude was great at identifying the episode of Star Trek I was thinking of:
But it failed to initially identify either Herb Simon or Geoffrey Hinton as dual Nobel/Turing winners:
If I explicitly prompted Claude about the winners, it returned details about them.
Claude was also not useful at helping me identify the “shaking the magnet” critique of Searle’s Chinese Room. I originally thought that it came from the late philosopher Daniel Dennett (who was horrified at how LLMs can fool people into believing they are human). It turns out the critique came from the Churchlands, but Claude couldn’t figure that out, I ultimately found that out through using a DuckDuckGo search.
I don’t know anything about your organization, dear reader, but I’m willing to bet that the amount of time and attention your organization spends on post-incident work is a function of the severity of the incidents. That is, your org will spend more post-incident effort on a SEV0 incident compared to a SEV1, which in turn will get more effort than a SEV2 incident, and so on.
This is a rational strategy if post-incident effort could retroactively prevent an incident. SEV0s are worse than SEV1s by definition, so if we could prevent that SEV0 from happening by spending effort after it happens, then we should do so. But no amount of post-incident effort will change the past and stop the incident from happening. So that can’t be what’s actually happening.
Instead, this behavior means that people are making an assumption about the relationship between past and future incidents, one that nobody ever says out loud but everyone implicitly subscribes to. The assumption is that post-incident effort for higher severity incidents is likely to have a greater impact on future availability than post-incident effort for lower severity incidents. In other words, an engineering-hour of SEV1 post-incident work is more likely to improve future availability than an engineering-hour of SEV2 post-incident work. Improvement in future availability refers to either prevention of future incidents, or reduction of the impact of future incidents (e.g., reduction in blast radius, quicker detection, quicker mitigation).
Now, the idea that post-incident work from higher-severity incidents has greater impact than post-incident work from lower-severity incidents is a reasonable theory, as far as theories go. But I don’t believe the empirical data actually supports this theory. I’ve written before about examples of high severity incidents that were not preceded by related high-severity incidents. My claim is that if you look at your highest severity incidents, you’ll find that they generally don’t resemble your previous high-severity incidents. Now, I’m in the no root cause camp, so I believe that each incident is due to a collection of factors that happened to interact.
But don’t take my word for it, take a look at your own incident data. When you have your next high-severity incident, take a look at N high-severity incidents that preceded it (say, N=3), and think about how useful the post-incident incident work of those previous incidents actually was in helping you to deal with the one that just happened. That earlier post-incident work clearly didn’t prevent this incident. Which of the action items, if any, helped with mitigating this incident? Why or why not? Did those other incidents teach you anything about this incident, or was this one just completely different from those? On the other hand, were there sources of information other than high-severity incidents that could have provided insights?
I think we’re all aligned that the goal of post-incident work should be in reducing the risks associated with future incidents. But the idea that the highest ROI for risk reduction work is in the highest severity incidents is not a fact, it’s a hypothesis that simply isn’t supported by data. There are many potential channels for gathering signals of risk, and some of them come from lower severity incidents, and some of them come from data sources other than incidents. Our attention budget is finite, so we need to be judicious about where we spend our time investigating signals. We need to figure out which threads to pull on that will reveal the most insights. But the proposition that the severity of an incident is a proxy for the signal quality of future risk is like the proposition that heavier objects fall faster than lighter one. It’s intuitively obvious; it just so happens to also be false.