Field notes from AI-native engineering: harness, XP, judgment
On harness engineering, XP rediscovered, and what AI is genuinely changing about how we build software.
Introduction #
Most enterprise product-development and delivery machinery, however well-intentioned, has a way of converting bold visions into faster horses.1 The feature factory,2 the Victorian-industrial line of thinking about software as production rather than as the flow of information, remains the default shape of how software gets built in large organisations, and the cost of that default is precisely what a generation of writing from Martin Fowler, Kent Beck, Don Reinertsen and others has been documenting since the late 1990s. What I want to share in this piece is a synthesis of observations that have been recurring across several pieces of work I have been building over the last year, some in my professional life and some as personal projects, all of them disciplined enough to function as real artefacts rather than as throwaway demonstrations.
These are field notes from that synthesis, written in the spirit of a diary kept openly, in the complex domain where probe-sense-respond is the appropriate posture rather than analyse-and-prescribe,3 and deliberately abstracted from the specifics of any single project. The universal observations are the ones worth sharing; the particulars belong to the people I work with, not to a blog post.
The shift #
Rob Bowley’s recent essay Sixty Years of Learning the Same Lesson4 makes the case, more elegantly than I will here, that each technological advance in software reduces friction just enough for the underlying truth about information flow to become visible, before the dominant organisational forms grow new defences and go back to mistaking process for progress. I take his argument as my premise and want to add what I think this current advance does differently.
Fred Brooks’s 1986 essay No Silver Bullet drew the distinction between essential and accidental complexity,5 and warned that no single development in technology or management technique would deliver an order-of-magnitude improvement in a decade. He was right for nearly forty years, so the burden of proof on any “this time it is different” claim is high. My claim is therefore modest: the lesson has not changed, and we will relearn it again in 2046 in some other vocabulary; but the cost of generating options has collapsed by perhaps two orders of magnitude, and that collapse has a particular consequence for the practitioner who knows what to do with it.
The consequence is that the loop tightens. What used to take a quarter takes a week; what used to take a week takes a day; what used to take a day takes a conversation. When the loop tightens to the point where the artefact arrives before the meeting, the meeting that was load-bearing for option-generation starts to misfire visibly, because that part of its load has now been lifted. The meeting still does other work — alignment, accountability, distributing risk, surfacing the people the work depends on — and the organisation that mistakes one of those loads for all of them will mistake the symptom for the cause. The AI is not doing the work; it is changing the economics of one particular activity, and the practitioner who learns to use that change with discipline finds themselves operating at a speed that the dominant organisational forms have not yet built defences against. The defences will come. They always do. The window is now.
One caveat the framing demands before it goes any further: the option-generation cost has collapsed for the practitioner, not for the enterprise. Every query is metered, every agentic workflow can spiral, and cost-per-decision is a new operating metric most finance functions are not yet equipped to read. The collapse is real and the direction of consequence is right, but the unit economics at scale are a serious open question, and an honest piece on AI-native engineering ought to name it before it gets used as a counter-argument by people who have not yet seen the practitioner-level shift.
Across the projects, the move I find myself making earliest, and the one I think matters most, is to think about the harness before the build. Writing code is now cheap; what determines whether the code compounds into something useful rather than burning brightly and leaving nothing behind is the scaffolding around the work — the documentation, the evals, the prompts versioned in files, the architecture decisions captured alongside the code that depends on them. The use of a language model in those opening conversations is not as a code generator but as a thinking partner, used in long-form conversation to surface what you already half-knew, to interrogate your assumptions, to map options against trade-offs, and to be quizzed in return rather than only to be queried. Thinking about the scaffolding before the first commit, with the model as collaborator rather than as typist, is the single highest-leverage move available to a practitioner working this way.
The paradigm we are working in is Andrej Karpathy’s. Software 1.0 was handwritten rules; Software 2.0 reframed neural-network weights as a kind of program in their own right;6 Software 3.0, his extension earlier this year, adds a third layer in which natural-language prompts become the program, the model the runtime, and the developer’s role shifts from authoring rules to directing intelligence.7 His admission that he is “mostly programming in English now, a bit sheepishly”8 gives the rest of us permission to admit the same shift without feeling we are exposing weakness.
Figure 1. Karpathy’s three software paradigms, with the agent harness wrapping the model and the organisational harness wrapping the team.
Field notes #
Eight observations from the practice so far, each carrying a name, an observation, a tension, and an open question. The structure is borrowed from Martin Fowler’s patterns essays;9 the hedged posture from Bowley, because I do not want to overstate what I know.
Workflows, not tasks #
The observation is that the enterprise unlock is the reimagining of processes rather than the acceleration of individual activities. McKinsey’s 2025 State of AI report identifies workflow redesign as the practice most correlated with realised value in enterprise AI deployment,10 alongside the sobering counter-point that fewer than ten per cent of organisations have agents at scale in any single business function despite seventy-two per cent reporting some generative-AI use.11 The lever exists; very few are pulling it. The practical handle is the distinction Anthropic draws between workflow and agent: a workflow is “a system where LLMs and tools are orchestrated through predefined code paths”, and an agent is “a system where LLMs dynamically direct their own processes and tool usage”,12 and the consequence is that most enterprise problems benefit from workflows with agentic seasoning rather than from autonomous agents. The workflow is where the value compounds, because the workflow is what crosses team boundaries and exposes the seams.
The tension is that workflows are organisationally harder than tasks precisely because they cross those boundaries. The people who own task-level metrics are not the people who own workflow-level outcomes, and the steering structures in most organisations are designed to manage tasks rather than redesign workflows. They often cross domains and functions. The reasonable question, asked by reasonable people, is “whose workflow is this”, and that question has no clean answer in most current operating models.
The open question is whether organisations that lack a culture of cross-team workflow ownership can adopt AI-native engineering at all, or whether they will be forced first to undertake a domain-and-capability rewiring of their own structure that is as hard as the AI work itself. The honest answer is that I do not know, and I suspect the AI adoption dashboards are systematically over-reporting progress because they measure tooling deployment and individual-task automation, not workflow rewiring. To borrow the Every editorial team’s term in After Automation,13 the dashboards are full of smuggled intelligence — the headline numbers hide the framing labour that produces them.
Three options beats a blank sheet #
The observation is that AI has collapsed the cost of producing options to near zero, and the appropriate practitioner move is to bring three (or at least one) working things to a stakeholder rather than to ask them what they want. People do not know what they do not know; a working artefact is a better prompt than a blank JBTD questionnaire. When the cost of generating an option was a week of design work, the move was to ask first; when the cost is an afternoon, the move is to design three and ask the stakeholder to react against them.
The tension is twofold. The first half is that this can look presumptuous in cultures that prize consensus and stakeholder consultation; the reasonable objection is “you have not asked me what I want yet”, and the careful response is that three concrete options give the stakeholder a better starting point than a blank sheet, and the iteration begins from there. The second half, more dangerous, is that three options are only useful if they span the right neighbourhood of possibility. Three plausible-looking variants of the same wrong idea will anchor the stakeholder more efficiently than questions would, in exactly the wrong place. The discipline is therefore not just “generate three” but “generate three that disagree with each other on the underlying assumption”, and at least one elicitation round before generation is usually the way to make sure the disagreement is the right one.
The open question is whether the discipline can be taught at scale or whether it requires a kind of practitioner confidence14 more easily described than transmitted. I lean toward taught — the cost-collapse is the lever, and once a practitioner experiences it the move follows naturally — but I am aware that XP, TDD, and BDD have been taught for twenty-five years and adoption remains patchy, so the transmission problem is probably harder than the economic argument alone suggests.
Demo is the spec #
The observation is that working software at the cadence of conversation, with stakeholders reacting to artefacts in days rather than to documents in months, is a faster path to truth than any specification-led process. Don Reinertsen’s Principles of Product Development Flow establishes the underlying mechanics very clearly: the cost of delayed feedback is non-linear, queues compound, and fast feedback at small batch size produces non-linearly better outcomes than slow feedback at large batch size.15 The legacy operating model produces large batches with delayed feedback by design; the AI-native one produces small batches with fast feedback by design; the difference in compounded outcomes is exactly what Reinertsen’s queueing models predict.
The tension is that demo-as-spec displaces a lot of organisational machinery that is currently load-bearing in most enterprises. Some of that apparatus — specifications as discrete pre-build phases, steering committees as gates, analysis as a separable step — starts to look mis-sized for a cost-of-options regime that has shifted under it. The underlying activities, though, do not disappear: elicitation, regulatory mapping, process modelling, data lineage, risk attestation, change-impact analysis. They migrate into the harness, into the demo cycle, into eval design. The risk in the current moment is collapsing the activity along with the apparatus, and ending up with brittle features that work for the demo-er’s mental model and fail the broader user base. Two further qualifications matter: in regulated estates (financial services, healthcare, defence, pharma) the specification is not a gate but a regulatory artefact, and demo-as-spec satisfies none of the audit trails that SOX, FDA, PRA, or MHRA require; and demos systematically bias the work toward what is demoable, which under-invests in resilience, observability, security, and accessibility — qualities whose absence is felt only after the demo has been clapped at.
Figure 2. Two loops, same arc, radically different timescale.
The open question I want to name openly rather than smooth over is whether some of that machinery exists for reasons we have not yet rediscovered. The pattern in software has been to remove a piece of process, find out three years later that it was load-bearing for something we had forgotten about, and put a different version of it back. I expect we will do that again.
No-PoC posture #
The observation is that the discipline most likely to determine whether an AI experiment becomes a foundation or a throwaway is the posture taken at the first commit. Treat the proof as production from day one: capture architectural decisions alongside the code that depends on them, run evaluations from the first agent call, write documentation as code, store prompts in files rather than string literals, refuse the dummy-data shortcut even when the tooling suggests it. Of these, evals are the single component I would most loudly recommend writing first — the eval is the specification of the behaviour you want, and an agent without an eval suite is an agent with an unfalsifiable claim to working. The experiment that takes this posture accumulates value as it runs; the one that takes the opposite burns brightly and leaves nothing behind, and its post-mortem says “we needed something more solid” when the solidity was an option from the first afternoon.
The tension is that this posture is more expensive in the first week and therefore systematically under-chosen by teams under delivery pressure. The first-week economics argue against it; the first-quarter economics argue strongly for it; and the gap between those two windows is precisely where the discipline lives or dies. A harder question sits inside the same tension: can the discipline be taught at scale, or does it require a kind of practitioner confidence more easily described than transmitted?
The open question is whether the discipline can be made cheap enough through tooling and templates that the cost-benefit tilts back, or whether it will remain a discipline only senior practitioners reach for. I suspect the tooling can carry more of the load than it does today, and the template for “treat the proof as production from day one” is itself a piece of harness worth investing in.
The harness pattern recurs at organisational scale #
The observation I want to offer, and to flag as an extension of Böckeler’s framing16 rather than a restatement of it, is that the same pattern recurs at a different scale. The technical harness around an AI agent is now reasonably well-described — Vivek Trivedy’s working equation “agent equals model plus harness” and his line “if you’re not the model, you’re the harness”;17 Addy Osmani’s discipline of ratcheting each agent failure into a permanent rule and his framing that AGENTS.md should read like a pilot’s checklist;18 Birgitta Böckeler’s organisation of the harness as feedforward guides and feedback sensors across maintainability, architecture fitness, and behavioural dimensions, with the load-bearing line that “a good harness should not necessarily aim to fully eliminate human input, but to direct it to where our input is most important”; and Ben Sigelman’s most recent proposal for fitness functions and an automated natural-selection controller that closes the loop on agent-generated variants by promoting survivors back to main.19 My observation: the team using agents is itself an intelligent core that needs scaffolding to be operationally useful at the scale of a programme, and that scaffolding has the same shape and ratchets in similar ways to the technical harness, though the components differ in substance because the foundational soil is socio-technical rather than computational. Six components in the organisational ring have begun to clarify themselves through practice:
- Documentation as code — architectural decisions, prompts, skill descriptions captured as artefacts versioned with the system and readable by both humans and agents. It carries the role system prompts play in the inner harness: durable context for the choices that have been made.
- Demo-as-spec rituals — weekly demos with working artefacts, the three-options-before-the-conversation posture, the no-PoC discipline that treats every proof as production from day one. The outer-ring equivalent of the inner harness’s feedback sensors; they tell the team when the work has stopped landing, at the cadence of conversation rather than of the steering meeting.
- The practices of extreme programming — pair programming, small releases, collective ownership, test-first development, continuous integration. The outer-ring equivalent of the hooks and middleware that hold the inner harness within tolerances, and the subject of the next field note.
- Conventions for the boundary between human and agent — explicit ratification protocols for what an agent may decide autonomously and what requires a human signature, the discipline of marking the provenance of every artefact, the deliberate placement of the human at the points where judgment is needed rather than where toil can be absorbed. The outer-ring equivalent of the orchestration logic that routes work between sub-agents; it is governance for the new mode, and most enterprises have not yet built it.
- Change management as a continuing practice — the rituals by which the team and its stakeholders absorb the continuous evolution of the operating model itself, including the operating model’s evolution between this week and next. The outer-ring equivalent of the ratchet pattern Osmani describes for the inner harness; in the outer ring, each organisational misfire becomes a refinement to how the team works together. Production-incident review for agentic systems sits here too, and is more important than its current literature suggests, because agents fail in non-deterministic and often unobservable ways and the post-incident discipline for them is still nascent.
- Judgment enablement — the deliberate practice of developing, exposing, and protecting the practitioner’s capacity to make the calls AI cannot. It includes the apprenticeship of less-experienced practitioners by more-experienced ones, deliberate exposure to consequential decisions early in development, and training people in how to evaluate AI outputs rather than how to produce them. This component has no clean inner-ring equivalent, which is part of why I call this pattern recurrence rather than isomorphism: the outer ring serves both the agent system and the humans operating within it, and the inner ring cannot.
The tension is that most enterprises are currently investing in tooling for the inner ring — the Copilot licences, the RAG infrastructure, the vector databases — while leaving the outer ring exactly as it was. The result is predictable: the inner ring misfires because the outer ring is wrong, the experiments produce brittle demos with a credibility problem, and the conclusion drawn from the failure is that “AI is not ready” when the AI is fine and the operating model around it is mis-sized. This pattern is not unique to AI; the socio-technical systems literature has been making versions of this claim for decades,20 and the recent Stanford Enterprise AI Playbook and the OpenAI guide to agentic governance restate it for the present moment.21 22 What I am adding is not the observation that the organisation must change beyond tooling, which is well-established, but the framing that the change has the shape of a harness and ratchets in similar ways, and that framing tells practitioners what to build next rather than just naming a gap.
Figure 3. The harness pattern at two scales. Components map by function rather than by structure.
The open question is whether the pattern-recurrence claim survives contact with a broader practitioner base than I have so far been in conversation with, and whether the six components are sufficient or whether something else has to be invented to sit alongside them. The right amount of confidence in this section is medium, not high. I am proposing an extension to a frame only weeks old, on a socio-technical substrate whose rate-of-change properties differ from the inner ring’s, based on practitioner experience that is so far my own and a small number of colleagues’. Pattern recurrence is the strongest claim the evidence currently supports; the stricter isomorphism claim I am stopping short of deliberately. The mechanisms also differ in ways worth naming: the inner harness ratchets via deterministic, repeatable artefact-and-eval cycles, while the outer ring ratchets via human meaning-making, politics, and slow trust formation, and any practitioner trying to transfer technique between rings should be alert to which mechanism they are relying on.
Boundaries dissolve, practices remain #
The observation is that traditional role boundaries — the precise demarcation between this domain and that, the careful definition of who owns which step in the workflow — may begin to collapse, because the work itself increasingly crosses those boundaries and because the AI substrate makes many of the handoffs that existed to manage information friction redundant. In their place the team needs and deserves a different kind of clarity, which is the clarity of practices: pair programming, small releases, the on-site customer, collective code ownership, test-first (in other words behaviour-first) development, continuous integration, the planning game, and the others Kent Beck described twenty-five years ago in Extreme Programming Explained.23 Beck’s deepest line, that “practices by themselves are barren, unless given purpose by values, they become rote”,24 is the anchor. AI-native engineering is, in this sense, XP’s second hearing.
The tension has three parts. First, not all handoffs existed to manage information friction; many existed for compliance, regulatory specialisation, risk distribution, and cognitive-load management, and AI dissolves none of those. Second, the barriers to adopting XP-style practices today are still mostly non-economic, and AI does not address them. Pair programming requires some form of presence — synchronous or otherwise — and the will to actually do it. Collective ownership requires management to give up command-and-control. The whole package requires high mutual trust between people who are increasingly distributed and asynchronous by default. AI makes the practices economically more attractive but does not make them organisationally more available, and the distinction matters. Third, the requests for clarity I am hearing from people three years into their careers are real and not satisfied by “we have collective ownership now”; some of them are requests that the new operating model has not yet answered, and the right response is to acknowledge the gap rather than wave it away with the word “agile”.
The open question is what we still have to invent that sits alongside XP: career paths look different in a world of collective ownership and pair programming; specialisation looks different when the substrate is general-purpose; we have not worked out how to recognise and reward deep technical craft inside a practice-led structure, and dare I say it, I do not think anyone has settled answers yet. The mistake the enterprise is currently making with AI is the reverse of the XP move: it is reaching for AI tools as new practices without examining the values they are meant to serve. Communication, simplicity, feedback, courage, respect, the five values Beck names, are the test against which any new practice has to be measured. An AI tool that improves typing speed but degrades the team’s communication has failed the values test, however good the metric on its dashboard.
The constraint is judgment #
The observation is that in knowledge work, the constraint is judgment, and once you see it, the rest of the AI-native architecture starts to organise itself around it. Sigelman makes the same diagnosis from the engineering side in the piece I cited earlier, observing that the bottleneck has shifted from typing-speed to “needing to be correct about the next thing we’re going to build”. His vocabulary is different from mine — decision-making quality where I have written judgment — but the observation is the same: cheaper generation has not eliminated the constraint, it has relocated it. A compatible argument arrives from a different vantage point in Every’s After Automation,13 which converges on the diagnosis from consumer-startup practice and gives it a sharper structural mechanism: “the frame is not the framer” — models can move fluidly between frames, but the framing work that selects which problem matters remains human, and the gap reappears one level up even at AGI. Eliyahu Goldratt’s The Goal, and the broader Theory of Constraints, gave us the Five Focusing Steps for any system whose throughput is limited by a single binding constraint: identify the constraint, exploit it, subordinate everything else to it, elevate it, and refuse to let inertia become the new constraint.25 In knowledge work the constraint is not machinery or material; it is the quality of the human’s attention, the depth of their judgment, their capacity to weigh competing considerations and to bear the accountability for the call. Everything else is subordinate.
Figure 4. Goldratt’s Five Focusing Steps, applied to knowledge work.
The tension is twofold. First, Theory of Constraints proper requires the constraint to be measurable to be exploitable; judgment is not measurable in any rigorous way, and the Five Steps mapping is therefore more rhetorical than operational — Goldratt gives the practitioner a slogan but not a literal next action. The mapping still earns its keep as a discipline: identify the constraint as judgment; exploit it by removing toil from around it; subordinate the harness, inner and outer, to feeding clean and well-contextualised work into the human’s attention; elevate it by improving the quality of the questions and the context; and refuse to let inertia become the new constraint. Second, and more dangerous, the legitimate fear about AI in knowledge work is that it will be used not to absorb toil but to shortcut judgment itself, with the practitioner outsourcing the call rather than the preparatory work. The Brooks distinction comes back here as the discipline we need: AI is genuinely good at absorbing accidental complexity, the formatting and drafting and synthesising of well-known material, and it is not yet reliable on the essential complexity of deciding-what-matters, weighing competing considerations, holding the practitioner’s accountability. The wrong posture, the one starting to produce visible harm in some fields, is the one in which the practitioner pastes a document into a model and accepts its summary as the verdict.
The open question is whether tooling and training can be designed to keep the essential/accidental boundary visible to the practitioner in the moment, rather than relying on senior practitioner discipline alone. The corollary is worth saying explicitly: the practitioner has more responsibility in this mode, not less. The toil is absorbed, the constraint is exposed, and the human now has to spend the freed time on the part only they can do. That is harder than the old job, and also more meaningful.
The talent stratification underneath #
The observation is that the practitioner who learns the harness compounds; the practitioner who does not gets stuck doing the toil AI absorbed, faster than they can absorb the upskilling. The bifurcation is the single demographic consequence of AI-native engineering that an enterprise transformation programme has to plan for explicitly, and most do not — they plan for tooling rollout and an attestation that “everyone has been trained”, which is not the same thing.
The tension is that the middle-skill knowledge worker is the most exposed layer, and they are also the layer the enterprise has built around for forty years: business analysts, mid-level developers, requirements managers, project coordinators, programme managers. Their work is not disappearing — somebody still has to elicit, model, coordinate, attest — but the cost-shape of that work has shifted, and the people who do not redesign their own role around the harness will find the work re-priced underneath them. This is not a tooling problem and a six-week course does not fix it; it is a question of role identity and professional development, and enterprises that treat it as a training-budget line will be surprised by the rate at which their middle layer either leaves or stagnates.
The open question is who in the enterprise is accountable for this. HR? L&D? The engineering line? The CTO? In most organisations it falls in the gaps between, which is exactly the territory in which workforce-scale problems first go untreated. It is also the question that the AI Council literature is most reliably silent on, and the one I expect to dominate the next eighteen months of enterprise conversation, displacing the current debate about model selection.
Where this leaves us #
These notes land in a conversation moving fast. Sigelman has just proposed his evolutionary-software architecture, the engineering-side discourse on LinkedIn and elsewhere is actively debating who can productively use coding agents and how, and the canonical inner-harness literature, still only weeks old in its current shape, is being extended in real time. My piece is one contribution to that conversation, not a settled position; I read those other contributions as fellow probes in the complex domain, and I hope a reader will read this one the same way.
What I am carrying forward is a continued attention to the outer ring of the harness, because that is where the most useful work for an enterprise practitioner sits and where the contemporary literature is thinnest; a continued discipline on the pattern of thinking-before-typing that opens any new piece of work; and a continued openness to being wrong about the XP turn, because the question of whether the practices are sufficient scaffolding for people without twenty years of context will determine whether AI-native engineering scales beyond senior practitioners.
I am also watching for two structural risks that any honest version of this argument has to acknowledge. The first is vendor and model lock-in: every harness pattern locks the team into a particular model and provider ecosystem; the cost collapse that animates the whole piece is true in May 2026 but is held in place by competitive dynamics rather than physical law, and the practitioner whose harness assumes today’s prices is one pricing change away from a different conversation. The second is the workforce question from the talent stratification underneath note above, which I expect to displace tooling as the central topic of enterprise AI within eighteen months. Neither risk argues against the discipline I have been describing; both argue for holding it lightly enough to evolve as the substrate evolves.
What I would ask of a reader who has got this far is to tell me where this maps onto their own work and where it does not. If the pattern-recurrence claim is right, it ought to be visible in adjacent fields too; if the XP turn is right, it ought to be helping teams find their feet rather than leaving them stranded; if the constraint-is-judgment line is right, it ought to organise the whole architecture cleanly, and if it does not, then the architecture for the reader looks different and we should be able to say so.
These notes will probably be wrong in (many) places. They are written in the spirit of a complex-domain probe rather than a complicated-domain prescription. The lesson, as Bowley reminds us, is sixty years old and we have learned it before; what is different this time is the velocity at which we can test our reading of it, and the responsibility that velocity places on us to test it carefully.
Bibliography and footnotes #
Henry Ford’s apocryphal line, “if I had asked people what they wanted, they would have said faster horses”, is the conventional shorthand for the failure of consensus-led product discovery. Whether or not Ford actually said it, the shorthand has earned its keep. ↩︎
John Cutler’s term for the organisational pattern in which teams measure their output by the number of features shipped rather than by outcomes delivered; first used in his 2016 piece 12 Signs You’re Working in a Feature Factory. ↩︎
David J. Snowden and Mary E. Boone, “A Leader’s Framework for Decision Making”, Harvard Business Review, November 2007. The Cynefin framework distinguishes simple, complicated, complex, chaotic, and disorder domains; in the complex domain the appropriate posture is probe-sense-respond. ↩︎
Rob Bowley, “Sixty Years of Learning the Same Lesson”, blog post, 30 January 2026. https://blog.robbowley.net/2026/01/30/sixty-years-of-learning-the-same-lesson/ ↩︎
Frederick P. Brooks Jr., “No Silver Bullet: Essence and Accidents of Software Engineering”, Computer, vol. 20, no. 4, April 1987 (originally presented 1986). https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.pdf ↩︎
Andrej Karpathy, “Software 2.0”, Medium, November 2017. ↩︎
Andrej Karpathy, “Software Is Changing (Again)”, talk delivered at Sequoia, June 2025, extending the 2017 framework with a third paradigm in which natural-language prompting of large language models becomes itself a form of programming. ↩︎
Karpathy, quoted in coverage of the same Sequoia talk and circulated widely thereafter. ↩︎
Martin Fowler and collaborators, Emerging Patterns in Building GenAI Products, martinfowler.com, ongoing. https://martinfowler.com/articles/gen-ai-patterns/ ↩︎
McKinsey and Company, QuantumBlack, The State of AI in 2025: Agents, Innovation, and Transformation, November 2025. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai ↩︎
Ibid. The report notes that fewer than ten per cent of organisations have agents at scale in any single business function despite seventy-two per cent reporting some generative-AI use, and identifies workflow redesign as the practice most correlated with realised value. ↩︎
Erik Schluntz and Barry Zhang, Building Effective Agents, Anthropic, December 2024. https://www.anthropic.com/research/building-effective-agents ↩︎
Every editorial, “After Automation”, Every, May 2026. https://every.to/p/after-automation. The piece’s load-bearing observations — that “the frame is not the framer”, that benchmarks are trivially zeroed by reframing, that “smuggled intelligence” hides in dashboards — converge from consumer-startup practice on a closely related diagnosis to the one I am developing here. ↩︎ ↩︎
Arrogance? ↩︎
Donald G. Reinertsen, The Principles of Product Development Flow: Second Generation Lean Product Development (Celeritas Publishing, 2009). ↩︎
Birgitta Böckeler, “Harness engineering for coding agent users”, martinfowler.com, 2 April 2026. https://martinfowler.com/articles/harness-engineering.html ↩︎
Vivek Trivedy, “The Anatomy of an Agent Harness”, LangChain blog, 10 March 2026. https://www.langchain.com/blog/the-anatomy-of-an-agent-harness ↩︎
Addy Osmani, “Agent Harness Engineering”, 19 April 2026. https://addyosmani.com/blog/agent-harness-engineering/ ↩︎
Ben Sigelman, “Natural Selection, but in Production: Envisioning Evolutionary Software Development”, LinkedIn Pulse, 21 May 2026. https://www.linkedin.com/pulse/natural-selection-production-envisioning-evolutionary-ben-sigelman-paplf/ ↩︎
The socio-technical systems tradition, originating with the Tavistock Institute studies of the 1950s and developed through the computer-supported cooperative-work literature since, has consistently held that the design of work cannot be separated from the design of the technology that mediates it. The recent arxiv working paper Socio-technical Aspects of Agentic AI (2026) restates the position for the present moment. https://arxiv.org/pdf/2601.06064 ↩︎
Pereira, Graylin, and Brynjolfsson, The Enterprise AI Playbook: Lessons from 51 Successful Deployments, Stanford Digital Economy Lab, March 2026. https://digitaleconomy.stanford.edu/app/uploads/2026/03/EnterpriseAIPlaybook_PereiraGraylinBrynjolfsson.pdf ↩︎
OpenAI, Building Governed AI Agents: A Practical Guide to Agentic Scaffolding, OpenAI Cookbook, 2026. https://developers.openai.com/cookbook/examples/partners/agentic_governance_guide/agentic_governance_cookbook ↩︎
Kent Beck, Extreme Programming Explained: Embrace Change (Addison-Wesley, 1999); 2nd edition with Cynthia Andres (Addison-Wesley, 2004). ↩︎
Beck and Andres, Extreme Programming Explained: Embrace Change, 2nd edition, op. cit. ↩︎
Eliyahu M. Goldratt and Jeff Cox, The Goal: A Process of Ongoing Improvement (North River Press, 1984); the Five Focusing Steps are codified in Goldratt, What Is This Thing Called Theory of Constraints and How Should It Be Implemented? (North River Press, 1990). ↩︎