The Bitter Lesson of Agentic Coding
Audience: Software engineers building toward autonomous coding loops.
Reading time: ~20 minutes.
In 2019, Rich Sutton published “The Bitter Lesson,” a short essay that became one of the most cited arguments in AI research. His claim: across 70 years of AI and machine learning, general methods that leverage computation, specifically search and learning, consistently beat approaches built on hand-engineered human knowledge. The pattern repeated across chess, Go, vision, and speech: researchers who invested in domain-specific structure were eventually overtaken by those who invested in search and learning at scale. The lesson is bitter because the domain knowledge feels like it should matter. It does matter, briefly, and then computation catches up and passes it.
There is a new bitter lesson, and this one is for software engineers.
The original bitter lesson told ML researchers: your hand-crafted features and domain heuristics will be crushed by learned representations trained on enough data. The new bitter lesson tells us: your hand-crafted implementations, your carefully engineered code, your hard-earned ability to write precise solutions, will increasingly be less valuable than simply defining what you want and letting the model figure out how to build it. The bitterness comes from the same place in both cases: the skill you spent years developing is the exact skill you need to let go of. For ML researchers, it was feature engineering. For software engineers, it is writing code.
Nate B Jones’ video “Claude Mythos Changes Everything” prompted me to revisit Sutton’s lesson in the context of agentic coding. Others have applied it to LLM scaffolding (Lance Martin at LangChain, Daniel Miessler’s “Bitter Lesson Engineering”), arguing that over-engineered frameworks become liabilities as models improve. That is true and important.
But the deeper bitter lesson is not about frameworks. It is about us. The instinct to over-specify, to write the implementation ourselves, to control every step, is not just an architectural mistake. It is a professional reflex that actively prevents us from getting the best results from the tools we now have.
This does not mean “no structure.” It means the right kind of structure. The bitter lesson for agentic coding is: invest in verification and goal-setting, not in implementation control. Define what done looks like. Let the model figure out how to get there. Build the simplest harness that lets you check its work.
I have been building a system around this idea for the past several months, and I think it represents a durable approach to autonomous coding, one that will survive multiple generations of model improvement. This post explains the philosophy. My implementation is open source as zat.env, but the concepts matter more than my particular version.
Two Papers That Changed How I Think
Two Anthropic engineering posts form the intellectual foundation for how I approach agentic coding. They are worth reading in full.
Nicholas Carlini’s “Building a C Compiler with a Team of Parallel Claudes” (February 2026) demonstrated that 16 parallel Claude instances, running in a loop, could produce a 100,000-line C compiler in Rust. The compiler passes 99% of GCC’s torture test suite and successfully compiles Linux on x86, ARM, and RISC-V. Nearly 2,000 sessions. Two weeks. No human-written code.
The breakthrough achievement is a marvel, but the detail that made it real for me was the price tag: approximately $20,000. Not “unknowable resources from a major lab.” Twenty thousand dollars. A number you can put in a budget. A number that, given how fast inference costs are dropping, will be dramatically lower within a year or two. When an achievement like this has a concrete cost, you can reason about it economically, and that changes everything about how you plan.
Carlini’s core insight, which I think of as the Carlini Principle: the quality of your verification loop determines the ceiling of your agent’s output. Not the quality of your prompts, not the sophistication of your orchestration. The test suites. The review mechanisms. The feedback signals.
“Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem.”
– Nicholas Carlini
He spent his time building test harnesses, not engineering prompts. That priority ordering is the whole lesson.
Prithvi Rajasekaran’s “Harness Design for Long-Running Application Development” (March 2026) formalized what Carlini demonstrated intuitively. Rajasekaran showed that separating generation from evaluation, even when the same model does both, produces measurably better results. His generator-evaluator architecture is modeled on GANs: the generator proposes, the evaluator critiques, and the tension between them drives quality upward.
One finding stood out:
“When asked to evaluate work they’ve produced, agents tend to respond by confidently praising the work, even when, to a human observer, the quality is obviously mediocre.”
– Prithvi Rajasekaran
This is why self-review does not work. You need structural separation. A dedicated evaluator, tuned to be skeptical, with concrete criteria to check against.
The objection writes itself: if you cannot trust the model to implement correctly, why trust it to evaluate? Because it is the structural separation that prevents the bias, not a smarter judge. An evaluator in a fresh context, checking against concrete spec criteria, does not exhibit the same self-congratulatory pattern. The human’s job is writing those criteria.
Rajasekaran also introduced the concept of “context anxiety,” where models prematurely wrap up work as the context window fills, a failure mode that explains a lot of the half-finished output people complain about in long coding sessions.
The fix is not bigger context windows. It is architecture that assumes context will be cleared. If your system writes progress to disk, carries structured handoff artifacts between sessions, and treats each session as a fresh start with full access to prior state, then context anxiety stops being a failure mode and becomes a design constraint you have already solved. Memory and file-based continuity are the answer, not longer conversations. I describe how this works concretely in the Turns section below.
Both papers converge on the same principle from different angles: invest in the checking, not the doing. This maps directly to how I think about building agentic coding systems.
Spec-Driven Development: The Control Mechanism
The term “spec-driven development” has become established in the community. Thoughtworks added it to their Technology Radar, a widely followed industry assessment of emerging practices, in 2025, calling it a key new AI-assisted engineering practice. Amazon built an entire IDE around it (Kiro). GitHub open-sourced a Spec Kit. There is now academic literature defining levels of spec rigor. I arrived at the same place independently. The approach is not new. What is new is the role the spec plays in agentic coding: it is not just a development methodology, it is the control mechanism. The spec is not documentation. It is the verification contract.
The problem SDD solves is drift. Agents without concrete acceptance criteria optimize for making tests pass rather than solving the problem. “Works but not good enough” stays vague indefinitely. An agent will happily generate code that compiles, passes type checks, and satisfies a test suite, while completely missing the actual goal. I have watched this happen repeatedly on complex projects.
This is still a spec in the traditional sense, a document that defines requirements, but it is tuned for a specific purpose: giving an autonomous agent and its review loop a concrete target to verify against. A well-written acceptance criterion is worth more than a well-written prompt, because it tells both the agent and the review loop what to verify. When I write a spec, I define what done looks like in concrete, checkable terms. The spec sits upstream of everything else: code review checks spec alignment, the test strategy checks criteria coverage, architecture review evaluates whether the design serves the spec’s goals. Without the spec, the rest of the system has nothing to anchor to.
This is the Carlini Principle applied at the task level. The spec is the “nearly perfect task verifier” that tells the agent what problem to solve. The review loop is the mechanism that checks whether it solved it.
Turns, Convergence, and the Autonomous Loop
I use the term turn to describe one complete pass through the spec-implement-evaluate cycle. A turn starts with a spec (or an inherited proposal from the previous turn), proceeds through implementation and review, and ends with a retrospective and a proposal for the next turn. The proposal is written to disk, so a fresh agent session can pick it up without depending on conversation memory.
This loop works at every scale. At the small end, a single engineer runs a few dozen turns on a feature, with a human stepping into the EVALUATE or PROPOSE phase whenever judgment is needed. The cycle is tight: implement, review, adjust, repeat, with human feedback keeping the loop honest. At the large end, the same structure scales to fleets of parallel agents pulling from a shared spec and a shared set of PRs. Carlini’s compiler is this loop: 16 agents, 2,000 sessions, each one a turn through spec-implement-evaluate, with the GCC torture suite as the verification contract. The loop did not change. What changed was the number of agents running it, and the fact that a comprehensive enough spec and test suite (the GCC torture tests) could stand in for human judgment in the EVALUATE and PROPOSE phases. That is what full autonomy looks like: not a different loop, but one where the verification contract is strong enough that humans do not need to be in it.
One of the earliest analogs in the community is Geoffrey Huntley’s “Ralph Wiggum Loop,” a pattern where each iteration picks a task, implements, validates, commits if passing, then resets context. Progress lives in files and git, not in the model’s context window. Huntley’s insight about stateless iteration with file-based continuity opened a lot of people’s eyes.
What I add to this pattern is structured handoff: the proposal artifact that carries not just “what to do next” but “what we learned” and “what surprised us” from the current turn. Context loss at turn boundaries is a real failure mode. Without a deliberate handoff mechanism, the next session re-discovers the same dead ends.
Each turn is designed to tighten quality. The spec prevents drift across sessions, gives review skills a contract to verify against, and makes “improve quality” a concrete, trackable activity rather than a vague aspiration. The turn count for a given feature is a meaningful measure: it tells you how much iterative refinement went in, and whether quality converged or the circuit breaker fired.
Convergence is what happens when the review-fix-review cycle measurably produces fewer issues each iteration. The community talks about “circuit breakers” (max iteration caps) and “termination conditions” (when to stop). Those are necessary but insufficient. A circuit breaker tells you when to give up. Convergence tells you when you are done. The distinction matters: a converging system is producing value with each iteration. A system that hits its circuit breaker has failed to converge, and you need to understand why.
Convergence detection is still partly aspirational in my system. I track machine-readable metadata (issue counts per review, commit hashes for scope) that will eventually enable automated convergence measurement. Today, I check it manually. The point is that the architecture supports it: every review produces structured data that a future orchestrator can consume.
Where This Actually Works
Philosophy is cheap. Carlini and Rajasekaran make strong arguments about verification-first development. I wanted to know if those principles hold up when applied to a real project over multiple iterations.
I have been using this system to build a diffusion-based handwriting style-transfer pipeline: a computer vision project that takes a photograph of someone’s handwriting and generates new text in that style. The pipeline has seven sequential stages (preprocessing, style encoding, generation, post-processing, layout, quality evaluation, composition) and uses multiple ML models (a fine-tuned diffusion model, a MobileNet style encoder, a TrOCR model for OCR validation). It is the kind of project where:
- A threshold change in one stage cascades through downstream stages in non-obvious ways.
- Automated metrics and human perception diverge. The CV metrics said height consistency was fine; human review said sizes varied 2.8x. Both were measuring different things.
- You need human-in-the-loop data during the development cycle. I have structured human reviews with ratings, defect categories, and a findings tracker that promotes recurring issues into the codebase as curated test data.
- The test suite has three tiers (logic tests under 1 second, GPU-accelerated quality tests at 2 minutes, full end-to-end pipeline tests at 10 minutes) with 13 automated quality metrics that serve as regression gates.
This is a single-engineer project, not a production system at scale. Carlini’s compiler is the evidence that verification-first principles hold at 100,000 lines and 16 parallel agents. What this project tests is the methodology: whether spec-driven turns, adversarial review, and structured handoff hold up across a real ML development cycle where the complexity is in cascading dependencies and metrics ambiguity, not in raw code volume.
The project has gone through multiple complete turns, each driven by a spec with concrete acceptance criteria like “height outlier ratio below 0.15” and “OCR accuracy above 0.7 on curated hard words.” The spec-driven approach forced me to define what “better” means numerically before implementing changes. That discipline, boring as it sounds, is what prevented the kind of drift that kills complex ML projects: where you keep tweaking parameters and think things are improving because the output looks different, not because it is measurably better.
The adversarial review layer caught real problems, like a post-processing defense layer that was clipping ink from the right edge of generated words, visible only in diagnostic output that the review process forced me to create. Without structured review criteria anchored to a spec, that bug would have survived as “sometimes the output looks a little off.”
Effort Selection and the Economics of Agent Compute
The Carlini compiler cost $20,000 in API calls. That number is a landmark because it makes autonomous coding economically tangible. If it costs $20,000 today, the question is not “can we afford it” but “when will it cost $200, and what do we build assuming it will?”
But cost optimization is not the most interesting implication. The more durable insight is about effort selection.
Today, you can choose which model to use for a task. Tomorrow, the spread between the cheapest adequate model and the most capable available model will be enormous. You will have a model that handles routine work for pennies per task. You will have a model so expensive you would not dare point it at a 2,000-session loop. Between those extremes is a sweet spot for steady-state work: the agents grinding through a task list, like Carlini’s 16 parallel instances.
The pattern that emerges is tiered effort. Use the adequate model for routine work. Use the expensive model when it matters: when the review loop detects that the agent is stuck, when the spec calls for architectural decisions, when convergence stalls. This is not speculative, my system already uses adaptive reasoning effort (effort: max on critical review and spec skills, default effort on routine operations). The principle will scale as the model ecosystem stratifies.
This means your harness needs to be effort-aware. It needs to know when to escalate. A system that always uses the best available model is wasteful, or maybe just economically impossible. A system that always uses the cheapest is leaving quality on the table. The interesting engineering is in the boundary: what signals tell you that this task needs more capability? Issue counts plateauing across iterations. Review findings that repeat without resolution. Spec criteria that resist satisfaction. These are the triggers for escalation, and they come from the verification loop, not from the implementation.
This connects back to the bitter lesson. The temptation is to encode effort-selection logic in elaborate rules. The durable approach is to give the system metrics and let it learn when to escalate. We are not there yet, but the architecture should assume we will be.
The Harness Problem
“Every component in a harness encodes an assumption about what the model can’t do on its own, and those assumptions are worth stress-testing.”
– Prithvi Rajasekaran
This is the most important sentence in the harness design literature. Every piece of scaffolding you build is a bet against model improvement. Some bets are good (verification will always matter). Some are bad (the model needs to be told how to structure a function).
Daniel Miessler sharpened this into a warning with his concept of the “BLE-hobbled system” (Bitter Lesson Engineering-hobbled): a system where scaffolding has aged past its usefulness and is now actively making the overall system worse.
This is not hypothetical. Lance Martin at LangChain described exactly this experience, watching a carefully designed multi-agent research system become a bottleneck as models improved. The structural constraints he had built around earlier model limitations, from avoiding tool calling to hard-coded agent decomposition, prevented his system from benefiting from newer capabilities as they arrived. The scaffolding he built to help was now the thing holding him back.
This is the bitter lesson applied to harness design specifically. The risk is not just that your scaffolding becomes unnecessary. The risk is that it becomes a liability, and you do not notice because it still “works.” It works worse than doing nothing, but you cannot see that because you never test the counterfactual.
Boris Cherny, the engineer who created Claude Code at Anthropic, arrived at the same conclusion from the practitioner side. He adopted Sutton’s bitter lesson as a core design principle for the Claude Code team: bet on the general model, not on scaffolding around it.
Scaffolding might improve performance 10-20%, but those gains get wiped out with the next model generation.
– Boris Cherny, paraphrased
The reason those gains get wiped out is that model capability does not improve linearly. It arrives in step changes. Cherny identifies the release of Opus 4, Anthropic’s first ASL-3 model, as the discontinuity where Claude Code’s growth went exponential. The scaffolding you built to compensate for the old model’s weaknesses becomes the thing preventing you from benefiting from the new model’s strengths. Not gradually, but all at once when the next model drops.
My approach: build the minimal harness that provides verification, context continuity, and safety gates. Everything else is the model’s job.
Concretely, my harness consists of:
- A spec skill that defines acceptance criteria and manages turn transitions. This is the verification contract.
- Adversarial review skills (code review, security review) that evaluate output against the spec. These are the evaluator in Rajasekaran’s generator-evaluator pattern.
- A pre-push hook that blocks code from leaving the machine until review passes. This is the quality gate.
- Persistent review files (checked into git) that carry context across sessions. This is the inter-session memory.
- Minimal coding conventions that target specific failure modes (revert on regression, stop after two failed fix attempts, write tests in the same increment as functionality).
That is it. No elaborate prompt chains. No multi-step reasoning frameworks. No rigid agent orchestration graphs. The skills are Markdown files. The hooks are bash scripts. The conventions are plain text. If Claude Code gains a serious competitor or a different model pulls ahead, the work to port is swapping invocation syntax, not rethinking my architecture.
The harness is deliberately minimal because the bitter lesson says it should be. Anthropic’s own “Building Effective Agents” guide (December 2024) makes the same argument from the framework side: start with the least complex agent pattern that works, add structure only when earned by real failure modes. That advice has aged well. The models will get better. The verification will still matter. The scaffolding in between should be as thin as possible so you benefit from improvements you did not anticipate.
The One-Shot Ceiling
There is a question this article needs to address directly: if models keep getting better, why bother with specs, turns, and verification loops? Why not just describe what you want and let the model build it?
The honest answer is that you can, up to a point. That point moves outward with every model generation. What Opus 4.6 can one-shot today would have been impossible a year ago. Models are genuinely excellent at generating complete, working software from a description.
But there is always a ceiling, and it is structural, not a temporary limitation waiting to be patched. Five properties of complex projects guarantee it:
Requirements are underspecified. “Add authentication” implies dozens of decisions (session lifetime, token storage, error UX, rate limiting) that the requester hasn’t articulated and may not have opinions on until they see a working version. An agent making all these decisions at once will get some wrong, and the cost of correcting compound errors is higher than making them incrementally.
Errors compound non-linearly. A wrong abstraction in step 3 of 20 doesn’t just make step 3 wrong. It warps steps 4 through 20 to fit the bad abstraction. In an iterative loop, the wrong abstraction is caught at step 4 and corrected. In a one-shot, the agent builds confidently on a flawed foundation because it has no external signal that anything is off.
Context degrades with scale. As generation grows, early decisions (architecture, naming, module boundaries) get diluted by later code. The agent loses coherence with its own earlier choices. This is why Rajasekaran emphasizes context resets: sustained coherence requires periodic re-grounding from persistent artifacts.
Verification requires a different mode than generation. When the same agent generates and evaluates in a single pass, it is biased toward confirming its own choices. This is the self-congratulatory pattern Rajasekaran identified, and it is a property of the task structure, not the model’s capability. Better models do not fix it because the bias comes from the structure of doing both jobs in one pass, not from insufficient intelligence.
Human intent is discovered, not transmitted. Users refine what they want by reacting to what they see. A one-shot denies the user this feedback opportunity. The result may be technically correct and still miss the point, because the point only became clear through iteration.
These are not model limitations. They are properties of complex systems interacting with sequential decision-making. Better models push the ceiling higher, but they do not remove it. What changes is the definition of “complex enough to matter”: yesterday’s hard problem becomes tomorrow’s one-shot, and the frontier of what requires structured verification moves outward.
I call this the state-of-the-one-shot: the upper bound on project complexity that a model can reliably handle in a single pass at any given moment. It is real, it is impressive, and it rises with every model generation. But it is always a bound.
If you work within the state-of-the-one-shot, you can build real things without methodology, without specs, without verification loops. This is vibe coding, and it works, because the model is genuinely capable enough for the complexity you are targeting. The problem is not vibe coding itself. The problem is mistaking the ceiling for the sky. When you hit a project that exceeds the current state-of-the-one-shot, you get the pattern everyone recognizes: 80% done fast, then “make it better” produces lateral movement instead of convergence. The agent generates plausible changes that do not get closer to the goal, because there is no verification contract defining what the goal is, and no structural separation between generation and evaluation. You are stuck at 80%, and more prompting does not unstick you, because the obstacle is architectural, not linguistic.
The methodology in this post is for the complexity band above the one-shot ceiling: the space between “the model can handle it alone” and “no current approach works.” That band is where the interesting engineering lives, and where the leverage is highest. You apply human judgment to exactly the parts that require it (specs, acceptance criteria, convergence decisions) and let compute handle everything else (implementation, review, iteration).
The bitter lesson says the one-shot ceiling will keep rising. The practical lesson: do not bet your project on it being high enough today.
The Autonomy Spectrum
I think about autonomy as a spectrum, not a binary:
- Supervised: the agent proposes, the human reviews everything.
- Gated: automated review must pass before code leaves the machine. The human reviews outcomes, not individual steps.
- Autonomous: review-fix-review loops run without human intervention per cycle. The human designs the spec and checks convergence.
- Multi-agent: parallel agents across branches with shared verification state.
My system is currently at Gated, moving toward Autonomous. The foundation for autonomous loops is already in place: structured review metadata, convergence-oriented architecture, turn-based iteration with file-based continuity. What remains is the loop orchestrator and the circuit breakers that make it safe to let the system run unattended.
Moving between stages is not a matter of confidence. It requires concrete validation: a telemetry completeness audit (are you capturing everything the system does?), reward-function validation (do your metrics rank outcomes the way a human expert would?), safety harness burn-in (have your circuit breakers been tested against simulated failures?), and a human reviewer capacity check (can reviewers keep up with the change cadence, or has approval latency become the bottleneck?). Skipping these steps is how you get autonomous systems that are confidently wrong.
I do not think fully autonomous coding (hand it a ticket, get a PR) is imminent for complex projects. The “80% problem” that Addy Osmani describes is real: agents rapidly generate 80% of the code, but the remaining 20% requires judgment, context, and architectural taste that current models do not reliably provide. The interesting work is in the middle: how do you structure the handoff between human judgment and agent execution so that the human’s time is spent on the 20% that matters?
Spec-driven development is my answer. The human writes the spec (the 20% that requires judgment). The agent implements, reviews, and iterates (the 80% that benefits from compute). The verification loop is the interface between them.
What I Think Is Durable
Some of what I have described will be obsolete in a year. The specific model choices, the exact cost curves, the particular failure modes that my coding conventions target. Models will improve, costs will drop, and some of today’s failure modes will simply stop occurring.
Here is what I think survives:
Verification as the ceiling. This is the Carlini Principle and it is structural, not contingent on model capability. No matter how good the model gets, you cannot trust output you cannot verify. The investment in test suites, review mechanisms, and quality metrics will compound indefinitely.
Spec as the control mechanism. Agents need concrete acceptance criteria. The form may change, but the need for a human-legible verification contract that defines “done” will not go away. The spec is where human judgment enters the loop.
Turn-based iteration with structured handoff. Context windows will grow, but the fundamental problem of context degradation over long sessions is architectural, not just a capacity limitation. Periodic resets of the model’s context window, with deliberate transfer of what was learned, will remain necessary for complex projects.
Effort-aware compute. The spread between cheap and expensive models will widen, not narrow. Systems that can match effort to task difficulty will outperform systems that use a single tier. This is an economic argument, not a technical one, and economic arguments tend to be durable.
Minimal harness design. The bitter lesson says: do not over-engineer the scaffolding. Build for verification and context continuity. Let the model handle everything else. Stress-test your assumptions regularly, and remove scaffolding when the model no longer needs it.
All of these point in the same direction. I am building toward autonomous coding loops. I believe the step-function in model capability that makes this practical has already happened. For many of us, it arrived with Claude Opus 4.6 for coding, but regardless of which model or moment you point to, the capability is here as of this writing. The architecture you need for autonomous operation is also the architecture that makes supervised and gated operation better. Every investment in verification, spec quality, and structured iteration pays off today, even before the loop closes.
There is a name for where all of this leads. The system I have described – spec-driven goals, verification loops, convergence signals, effort-aware compute – is a human-operated prototype of what I call an agent-hypervisor: a system that designs, deploys, monitors, and self-optimizes entire fleets of LLM agents. Today I play the hypervisor role manually, writing specs, checking convergence, deciding when to escalate effort. The architecture is built so that each of those functions can be automated incrementally, without rethinking the overall design. The destination is self-optimizing agent fleets, and eventually, computational patterns that no human would design. The path runs through everything described in this post.
What It Actually Feels Like
I promised this post was about the bitter lesson, so let me tell you what the bitterness actually feels like.
My journey through AI-assisted coding was probably typical. Copy-paste code and functions from ChatGPT into my editor. (Wow, this actually works.) Tab-completion in Cursor. Having the Cursor agent modify code for me, then write code, then really driving all code creation through conversational prompts. But I was still in an IDE, so at any given moment I could review and even modify the code directly. I was still a programmer, just a faster one.
Then I switched to Claude Code, a TUI. Terminal User Interface. No IDE. No syntax-highlighted editor pane with my code in it. Just a prompt where I type what I want, and an agent that goes and builds it.
That is where it got hard.
The disorienting thing was that I already knew Claude Opus 4.6 was excellent. That is what I had been using in Cursor, same model, same capabilities. How is this any different? But it felt completely different. In the IDE, I had the illusion of control: I could see the code, I could intervene at any line. In the TUI, I was just typing instructions and trusting the agent. One part of me thought: I have been shipping software for thirty years and now I am typing wishes into a terminal. Another, deeper part thought: can this stuff really work?
The answer is that it depends on what you mean by “far.” For simpler projects, ones that would have been too hard before the step-change, raw Opus 4.6 in a TUI can take you remarkably far without much methodology at all. For complex projects like the CV pipeline I described, 11,000 lines of Python with a 7-stage ML pipeline, you need the methodology: specs, acceptance criteria, turns, convergence checking. Without it, the agent produces impressive-looking output that quietly drifts from the goal. With it, the agent builds something that actually works. And this cycle will repeat with every step-function in model capability: yesterday’s “needs methodology” becomes tomorrow’s “just ask the model,” and the frontier of what requires structured verification moves outward.
But the real punchline, the thing I gained that I did not expect, was a view of how this will actually go in the future: more and more end-to-end complex solutions like Carlini’s compiler. Getting there requires internalizing the bitter lesson, letting go of writing the code and investing in defining the outcomes, trusting the verification loop instead of your eyes on the diff.
I am a real software engineer. In fact, using these tools beyond the level of a vibe coder requires me to be one. For now.
The spec does not write itself. The acceptance criteria do not emerge from thin air. The judgment about when the system is converging versus spinning, about what the next turn should target, about which architectural bet to make, that is engineering. It is just a different kind of engineering than the one I spent thirty years learning.
That is the bitter lesson. It is bitter because the old skill mattered and was hard-won for me, and it is a lesson because the new skill matters more.
Start Here
If you want to try this approach, here is what I would do Monday morning:
- Write a spec before you start your next feature. Not a design doc. A list of concrete acceptance criteria that define done. Make them checkable: “OCR accuracy above 0.7,” not “improve text recognition.”
- Separate generation from evaluation. Even if you are using a single model, do not let the same session that wrote the code judge the code. Run a review pass with adversarial criteria. Check against the spec.
- Make progress survive across sessions. Write your findings to disk. When the next session starts, it should be able to pick up from a file, not from your memory of what happened.
If you happen to be on Linux running Claude Code, zat.env implements all of this as a lightweight framework: spec-driven development, adversarial code review, security auditing, and a pre-push quality gate, installed with a single script. It is my personal environment and it is opinionated, but it is open source and you are welcome to use it or steal the parts that make sense for your workflow. The concepts are more important than my particular implementation. Skills are Markdown. Hooks are bash. Conventions are plain text. The whole thing is designed to be replaced by something better, which is kind of the point.
References:
- Sutton, R. (2019). The Bitter Lesson. The foundational argument that general methods leveraging computation beat hand-engineered domain knowledge.
- Nate B Jones (2026). Claude Mythos Changes Everything. The video that prompted me to revisit Sutton’s bitter lesson in the context of agentic coding. Jones argues that step changes in model capability make over-engineered scaffolding a liability.
- Martin, L. (2025). Learning the Bitter Lesson. A detailed case study from LangChain of watching rigid agent scaffolding become a bottleneck as models improved.
- Miessler, D. (2026). Bitter Lesson Engineering. Coined “Bitter Lesson Engineering” and the concept of a “BLE-hobbled system” where scaffolding has aged to the point of making the system worse.
- Carlini, N. (2026). Building a C Compiler with a Team of Parallel Claudes. Anthropic Engineering. 16 parallel Claude agents produce a 100K-line compiler, demonstrating that verification loop quality is the ceiling on autonomous output.
- Rajasekaran, P. (2026). Harness Design for Long-Running Application Development. Anthropic Engineering. Generator-evaluator architecture for extended autonomous sessions, with the key insight that separating generation from evaluation produces measurably better results.
- Cherny, B. (2026). Head of Claude Code: What happens after coding is solved. Lenny’s Podcast. The creator of Claude Code on adopting Sutton’s bitter lesson as a core design principle: bet on the general model, not on scaffolding around it. Identifies Opus 4 as the step change where agentic coding went exponential.
- Anthropic (2024). Building Effective Agents. Foundational taxonomy of agent design patterns, with the key advice: start with the least complex pattern that works.
- Osmani, A. (2026). The 80% Problem in Agentic Coding. The gap between what agents generate rapidly and the remaining work that requires human judgment.
- Huntley, G. (2025). The Ralph Wiggum Loop. Stateless iteration with file-based continuity, the pattern that opened my eyes to turn-based agent architecture.