The Agentic Transition Playbook
Practices that were nice-to-have for humans become load-bearing for agents. The standards, lints, types, specs, and review conventions you used to nudge toward when humans wrote the code become required when agents do. Discipline does not go away in agentic development; it goes up. (I made the longer case in Discipline Is the Point.)
Most organizations will not fail at agentic development. They will buy licenses, get some productivity wins, ship faster PRs, and stop there. What they will fail at is reaching the next plateau. Agentic transformation is not a single event, but a series of plateaus, and each one requires that the engineering organization has built the capacity to absorb what agents produce at that scale. Review is the first place a stalled climb shows up, with PRs arriving faster than humans can evaluate them. That is the visible symptom of a deeper limit. Everything underneath was built for human-paced engineering, and none of it can carry the throughput and decision load the next plateau demands.
Most CTO discussions of this problem are about adoption: pick a tool, get licenses, write a memo. That is not the work. Agentic transformation is a transformation of your entire codebase and organization. Buying a license for Claude Code or Codex does not start or finish it. Your structures change, your people change, your code changes, and the tools around it all change. This playbook is about one slice of that transformation: rebuilding how engineering plans, writes, reviews, and ships code so humans and agents operate inside the same system rather than talking past each other. It is the engineering transition. Engineering then carries the rest of the organization with it, but that broader transformation is a different essay. It is also a playbook for existing organizations, ones with codebases, teams, customers, and constraints. A startup starting fresh has different choices; this is not for them.
Everything below is what we have actually built on a real, brownfield codebase: multiple languages, services with operational history, and the usual accumulated constraints. The playbook so far has produced a 3x productivity improvement across engineering, with the change failure rate held flat. We are aiming for 6x in July. This is not a thought experiment.
The mechanism to keep in mind through every layer below is this: move controls into the automation layer, and put them where the agent can see them. Anything a human enforces by hand becomes a bottleneck the moment agents start writing most of the code. Anything the type system, lint config, test suite, or deploy gate enforces scales. But automation only works if the agent doing the work can reach it. A spec the agent cannot read is not a spec. A custom lint that lives off the agent's path is not enforcement. For us, this means everything load-bearing lives in the repo. Most of what follows is a project of pushing controls downward, into machinery, and inward, into the context the agent already has.
The agentic transition has a definite shape. The shape has layers. The layers have to be built in order. There are four layers and a method.
Substrate
The substrate is the physical plant. Without it, nothing else holds up. It is targeted at four specific failures: context fragmentation across repositories, environment variance across machines, agent context humans cannot verify, and the loss of flow control when work moves between an engineer's desk and the rest of the organization faster than anyone can track.
The unifying principle underneath everything that follows is that the desktop is the working environment for both humans and agents. Anything an engineer needs to do their job should be doable locally, and anything doable locally should be reachable by an agent running there. That includes code and documentation, which the monorepo handles. It includes tooling and dependencies, which Nix handles. It includes production observability (logs, metrics, traces, deploy state), which we pull in through CLIs that the agent can call directly. The exceptions are the things you genuinely cannot put on a laptop, and those should be the only exceptions.
The reason this matters is mechanical. In development, agents work by iterating: write, run, observe, adjust. The shorter the loop, the faster they ship. Local iteration is faster than any iteration that has to round-trip through a remote system, and the difference compounds across the dozens or hundreds of cycles inside a single task. In operations, the more an agent can see directly, the more useful it is to the engineer next to it. An agent that can pull a Prometheus metric or tail a log without escalating to a human is one that completes the diagnostic instead of asking for help.
We are not all the way there. We are moving in that direction deliberately, expanding what an agent can do locally as we build out each capability. The further we get, the better the agents work.
Start with a monorepo, and make it ambitious. Code, services, deploy configuration, yes, but also documentation. Architecture notes, runbooks, design docs, post-mortems, the things that used to live in three different wikis nobody updates. Pull them in. Context locality is one of the highest-leverage moves you can make in this transition. When everything an agent might need is one repo away, the answers it gives are dramatically better, and the same is true for humans. As a bonus, the cost of cleaning up after an agent drops sharply when there is one place to look, and shared automation (lints, scripts, code generators) has one place to live.
This is a real lift in an existing organization, and you should plan for it to take a long time. The good news is that the migration provides its own cadence and feedback loop: each project that lands in the monorepo picks up everything you have already built, including shared linting, shared automation, shared documentation, and the ability for an agent to see across what used to be repository seams. You can read your progress directly from what is in versus what is still out. We are still in the middle of ours. If your organization genuinely will not commit to the monorepo, the floor is to make the same context locally available wherever your agents work: pull in adjacent code, docs, and configuration. But understand what you are accepting. Without the monorepo, context locality stops being enforced and starts being aspirational, and stale context is the most reliable failure mode there is. The monorepo is the right answer. If you will not buy the substrate, you are not doing an agentic transition; you are running an experiment on top of an unchanged organization.
Then solve the environment distribution. We use Nix and Home Manager to deliver skills, tools, and baseline configuration to every engineer's desktop. Every person on the team has the same floor. That matters because agents interact with local tooling constantly, and variance in local tooling is variance in output.
Then solve context access. Agents need to read logs, query metrics, pull traces, and look at production state: context that does not live in the repo. MCP servers are the popular answer. We have not had good results with them. They are nondeterministic. Humans cannot easily run them directly to verify what an agent saw. They are hard to test. Failures are inconsistent and expensive to debug. CLIs do not have those problems. A CLI is deterministic; the same command produces the same output. It is inspectable. It is a shared language between humans and agents: anything an agent runs, an engineer can run, with the same syntax and the same result. Agents are extraordinarily capable with command-line tools; they have seen millions of them. A well-designed CLI that fronts your logs, your Prometheus, your deploy system, or your service registry gives an agent a stable, scriptable, composable interface to the parts of your world that don't live in code, and gives a human a way to check the agent's work. The bonus is that you don't wait on vendors to ship MCP integrations for your stack. You build the CLI you need, and your agents pick it up immediately.
Then solve flow into and out of the engineer's desk. Once agents are producing work at scale, the boundary between an individual engineer and the rest of the organization becomes the system's busiest interface. It has two sides. The outflow side is what leaves: shaped, reviewable units of work moving toward the rest of the team. The inflow side is what returns: review demand, prioritization, the queue of things needing attention. We treat these as a paired system. We call ours Stack and TRM. Stack handles outflow. It defines a unit of work bigger than a commit and smaller than a project, the natural shape of what an engineer (or an agent working under one) sends out. TRM handles inflow. It is a dashboard of action items the engineer needs to intake, the instrument that prevents the review side from going blind. The names do not matter; the pair does. Without both, the desk becomes a bottleneck in one direction or the other.
Underneath all of this is CI/CD, and you cannot afford to be casual about it. Once agents are producing code at volume, your CI/CD pipeline is the backbone on which everything else rides. The monorepo's shared automation runs through it. The custom lints fire there. The guardrails execute there. The merge that closes a Stack happens there. A red pipeline is not an inconvenience; it is the entire system going dark. Treat a broken main build as an on-call event, just as you would a production outage, because it is one. So much is moving through the pipeline at once that any time it breaks, you lose throughput across every layer above it. The discipline of keeping it green is what makes the rest of the playbook possible.
Grammar
Substrate is the house. Grammar is the set of rules about what you are allowed to say in it.
Language and stack decisions come first. We chose Rust as the emerging primary backend language. No new TypeScript, no new Kotlin, no new Node. Agents do better work in languages with strong types and clear semantics, and the cost of constraining the stack is far lower than the cost of letting every repo pick its own tools. There is also a deeper reason: Rust pushes more of the control surface into the type system, where the compiler enforces it instead of a reviewer. That is the direction we want every layer of the stack to go. (I made the longer case in Agents Reprice Everything.)
Lint configuration is the next layer, and you should take it as seriously as language choice. An opinionated lint config is how you teach an agent what good means in your house. Rust makes this especially powerful because the language itself is unusually capable of encoding intent. Deny-by-default rules become a real teaching mechanism, not just style enforcement. The default rules from Clippy and equivalents are necessary but not sufficient. You will need custom lints for the conventions specific to your codebase: your error type, your domain types, and the patterns you actually care about. As an example, we have a custom linter based on dylint that enforces it. The names are unimportant. What matters is that one custom rule, written once, obviates a whole class of mistakes that an agent would otherwise reliably make. Find the equivalents in your codebase and write the lints. They pay for themselves immediately.
Review conventions are the third layer. PRs scoped to logical changes, not arbitrary unit counts. SQL-first over ORM tricks. Mocks treated as an architectural smell. These are not opinions about style. They are the grammar of how code enters the system, and they are enforceable in ways that scale with agent throughput.
Control
Substrate and grammar are static. Control is how you steer.
Control has two sides. The forward side is the spec. A spec names the outcome the thing is supposed to produce, the decisions already made, the constraints, and what is out of scope. Without one, an agent does what it interprets; with one, it does what was intended. That sounds small. It is the difference between work you ship and work you redo.
Specs were always good practice. With humans, you could often get away without one. Tacit understanding, hallway conversation, and a senior engineer's judgment carried the work through ambiguity. With agents, none of that machinery exists. The spec becomes the durable artifact that holds intent, and it is the unit of leadership work in agentic development. We call ours Spec, plainly. The name and the implementation do not matter, but one constraint does: the spec has to live where the agent can read it. A spec sitting in a tool the agent cannot reach is not part of the system. For us that means in the repo, alongside the code it governs. What matters is that the spec becomes a durable, addressable artifact that agents can read and humans can audit.
The reverse side is guardrails. Guardrails define the shape of acceptable output and catch divergence between intent and result before a human review cycle is spent on it. Tests that encode the spec. Review checks that compare output against intent. Deploy gates that refuse output failing either. Dependency scans, secret detection, SBOM generation, policy-as-code: all the same shape, all more important once agents are producing code at scale. Lint catches grammar violations; guardrails catch intent violations. They are different mechanisms aimed at different failures.
Most shops have one or the other. Spec without guardrails is aspirational. Guardrails without spec is whack-a-mole. The combination is what makes throughput safe, and throughput is the whole game once agents write most of the code.
Treat spec and guardrails as peers. They are the two halves of the same mechanism.
The Woven System
The layers above are necessary and not sufficient. The thing that turns them into a system is the weave.
The mental model is a graph. Spec is the root. Stack is the decomposition. Commits are the leaves. TRM is the view. In a well-woven system, every artifact carries its parent. A commit carries its stack ID. A stack carries its spec ID. TRM reads the links and shows the spec behind every piece of review work, one click away. The graph runs both ways. Open a spec and see every stack underneath it and every commit on every stack.
The cheap version is an ID convention and a link field in each tool. The real version refuses to let artifacts exist without parent links. A commit without a stack reference is rejected. A stack without a spec reference is rejected. The graph becomes load-bearing rather than decorative.
Traceability is a nice side effect. The real payoff is that agents can use the graph too. An agent picking up review work pulls the spec without being asked. An agent starting a new stack checks whether the spec has already been partially addressed elsewhere. Humans stop being the integration layer. That is the shift.
Method
Substrate, grammar, control, and weave are what. Method is how.
Sequencing matters because the layers depend on each other. Substrate first. Grammar second. Control third. The weave comes when the pieces are stable enough to link together. Trying to install control before grammar produces guardrails that argue with the language choices. Trying to weave before control produces a graph of artifacts nobody trusts. When the playbook stalls, the failure is usually one layer below where you are investing.
Throughput at the review layer is the hidden bottleneck. Agents generate PRs faster than humans can review them. If the review layer does not evolve, the whole transition stalls at the gate. Logical-change-scoped PRs, TRM, and well-shaped guardrails are all attacks on the same problem. The harder question is whether agents themselves should review. We don't trust them to. Human review is the last gate before production, and we have not pulled back on it. The work is making that gate easier for humans to walk through, with better diffs, better context, and better tooling. Not removing the human from it. Other shops will make different calls here. We are not ready to.
The system is iterative. The playbook above is a shape, not a blueprint. We have walked back rules that seemed obviously right (one-commit-per-PR was one of them) and added rules we did not know we needed. The lint config evolves. The spec format evolves. The weave gets tighter as we discover artifacts that should have carried parent links and didn't. Treat the system as something you tune, not something you install.
The tools change, too. We still use Jira for operational ticketing, but Jira is no longer where our project planning lives. Spec, Stack, and TRM are. The existing tools were built for a different world, and as you replace them, you get the chance to build the observability you actually need: review throughput, agent-PR rates, spec-conformance gaps, migration progress. You cannot tell whether the playbook is working without those instruments. Build them early enough to use them.
Bandwidth allocation is the related risk. A lot of your engineering capacity will flow back into engineering itself in the early stages: building TRM, the spec system, the linters, the CLIs, the migration. That is correct in the short term and wrong as a steady state. The whole point of the transition is leverage on customer outcomes, not a permanent renovation project. Once the substrate and grammar are stable, push the bandwidth back out toward revenue work.
Leading non-engineering leadership through this is its own skill, and it is part of the playbook, not adjacent to it. From an org chart, the transition is invisible. It looks like a normal engineering department doing normal engineering work. Without language that makes the invisible legible, executive sponsors get nervous in the middle, the grammar and control layers get second-guessed, and the weave never gets built. The CTO's job here is not just architectural; it is translation.
People
Everything above is built on top of something it cannot replace: the engineering organization itself. Whether the org will absorb the rules is the most fragile assumption in the playbook.
Every layer is contested ground. Choosing Rust tells a fraction of your engineers that the career capital they've built is depreciating. Mandating specs threatens engineers whose authority came from being the keeper of undocumented decisions; that authority evaporates when intent is written down, versioned, and addressable by agents. Custom lints are read by some as enforcement of one person's opinions. The weave is read as surveillance. None of these readings is wrong, exactly. The transition is a real shift in who has what kind of leverage, and people are correct to notice.
This is the work that does not show up in any architecture diagram, and it is where the playbook lives or dies. You cannot drive a transformation of this scope without the respect of your team. If you do not already have it, if you have not built the kind of trust where engineers extend you the benefit of the doubt on hard calls, none of the layers above will land. They will be undermined slowly, by people who have other ways of getting their work done and no reason to believe your way is better.
The CTO translation layer mentioned earlier is part of this, but it works in two directions. Outward, you are explaining the transition to non-engineering leadership in terms of business leverage. Inward, you are explaining the transition to your engineers in terms of their craft and their careers: why this is more interesting work, not less; why their judgment matters more, not less; what the path looks like for the person whose primary skill no longer fits the new grammar. Migration paths are not a soft-skills nicety. They are how you keep the people whose tacit knowledge of your systems is irreplaceable.
Some friction is the signal that the system is doing what it is supposed to do. Authority moving from individuals to artifacts is the explicit goal. But friction the leadership has not earned the right to impose is what kills these transitions. Earn it first, or expect the playbook to stall.
This is not a tool adoption story. It is a rebuilding of how code gets planned, written, reviewed, and shipped so that agents and humans operate inside the same system rather than talking past each other. The substrate gives them a shared floor. The grammar gives them a shared language. Control gives them a shared steering mechanism. The weave gives them a shared memory. Method is what gets it built in the right order without stalling the business.
And none of it is built once.