11 min read

The Flood Moves Upstream

Six months ago I wrote that a flood was coming. It's here. We tripled deployment rate in one month. Stakeholders can't produce enough work for us. We built an unbelievably powerful engine that isn't connected to the wheels. Transform the rest of the business, or accept the plateau.
The Flood Moves Upstream
Photo by Kelly Sikkema / Unsplash

Six months ago, I wrote that a flood was coming. The PR Tsunami argued that agentic development would turn engineering into a firehose of change that the rest of the system wasn't built to absorb. That piece was speculative.

This is the update. The short version: the original called it. Agentic tooling doesn't just accelerate work; it enables it. It dissolves whatever human serialization point it's applied to. Engineering got faster, hit a new ceiling, and the new ceiling sits outside engineering, which is exactly what the original predicted. What the original couldn't know, and what's observable now, is how fast the pattern moves and what it looks like from the inside while it's happening. That's most of what I want to cover here.

The step function

When I took over as CTO of my current organization, we deployed about 20 times a month across 33 engineers. That's roughly one deployment per engineer every seven weeks. For a non-engineering-led company with a legacy stack and the accumulated habits of a team that had been through several reorgs, it wasn't surprising.

We spent the next several years on the usual work: trunk-based development, test discipline, CI/CD investment, small-batch delivery, and getting rid of the rituals and structural bottlenecks that made changes expensive. By the middle of last year, we'd settled into a stable rhythm of 250 to 300 deployments a month across a team we'd deliberately made smaller, around 21 engineers. Call it 13 deployments per engineer per month. That's a twentyfold improvement per capita over where we started, and it took a lot of scar tissue to get there.

We held that rate for six months. The system had reached equilibrium under the assumptions we were operating under.

In January, we rolled out agentic tooling across the team in earnest. Claude Code, internal scaffolding, and a set of conventions we'd been sharpening. January was messy. Adoption curves were still climbing, and the numbers didn't mean much. February was the first clean month.

February: 720 deployments. March: 750.

None of this happened in a vacuum. We'd been anticipating the shape of these problems for a while and building tooling to handle them. Our stack tool manages the PR stacks the team is now producing at volume. Our review-management tool gives engineers a way to handle the incoming review load without drowning in it. We have a growing collection of internal conventions and scaffolding that make agentic work safe and legible. We iterate on all of it constantly. The throughput numbers don't come from dropping Claude Code on the team and walking away. They come from running agentic tooling on top of infrastructure purpose-built for this, and they require continuous investment in that infrastructure to sustain. Teams that skip the substrate will see an initial productivity bump, then watch it erode under the weight of the coordination problems the tooling creates.

A flat baseline for six months, one transition month, and a new plateau about 2.7x higher that has now held for two. The long arc, from 20 deployments a month across 33 engineers to 750 across 21, works out to around a 60x improvement in deployments per engineer. The recent leg of that arc happened over the course of one month.

Two caveats before that number does more work for me. First, deployment frequency is a proxy, and proxies drift. We count a deployment as a meaningful change shipped to production, and that definition hasn't moved across the arc. Second, deployment frequency without stability metrics is an incomplete picture. Our change failure rate and MTTR aren't worse than they were at the lower volume. That's what makes the headline number real.

The ceiling, and where it is

Two months at the new plateau isn't enough to call it a permanent ceiling. It is enough to say the system has re-equilibrated, and to ask where the new binding constraint is.

Our mean time from PR created to PR merged is about ninety minutes. That's fine. Better than fine. It's the kind of number that suggests review isn't the problem. The mean lies.

Our engineers work in stacks: small PRs layered on top of each other, each building on the last. When one PR in a five-deep stack takes four to six hours to merge, rather than the usual ninety minutes, everything stacked above it stalls. The engineer is either blocked or context-switching, and both hurt.

The mean hides this. Ninety minutes looks healthy. The tail is what actually bites. At 250 deployments a month, we rarely encounter enough bad outliers to treat them as a nuisance. At 750, we're drawing from the same distribution three times as often, and those outliers compound across every stack in flight. That's what sets the ceiling.

You'll be tempted to call this a reviewer-capacity problem. It isn't. Adding reviewers compresses the mean. It doesn't touch the tail. The tail comes from things headcount doesn't fix: a design disagreement that surfaces late in review, a flaky CI run, a merge conflict from something that landed first, a reviewer heads down elsewhere when your PR drops. This is a queueing theory problem, not a staffing problem.

The local optimization we're retiring is the one-commit-per-PR approach. It was a good rule at 250 deployments a month. Small blast radius, clean reverts, clean bisects. At 750, it multiplies the PR count and, therefore, tail exposure, without doing more work on what the rule was actually good at. Multiple commits per PR with small commit boundaries preserve the blast-radius properties.

The nervousness on the team is really about review quality, and they're not wrong to be nervous. A rubber-stamper rubber-stamps either way. A larger PR just gives them more surface area to not look at. The rule was a crutch for a review culture that needed to get stronger on its own merits: commit-level review, tighter static analysis, explicit rubrics for people who need them, and a gradual shift in what the review function is actually for in an agentic world. Keep the crutch, and the review culture never gets built.

The general principle, not the specific rule, is what I want you to take away. Local optimizations that were tuned for one arrival rate will break when the arrival rate changes by an order of magnitude. You have to be willing to retire them, even when they were the right answer a year ago.

The production side is about to move again

We're still going to be working in stacks. What's changing is who coordinates them. Right now, an engineer owns a stack end-to-end, making them the coordination point. They can run more than one stack at a time, but each additional stack adds context-switching overhead and more exposure to stalls. The shift we're making now is to have the arbiter coordinate the stack. That moves the coordination cost off the engineer, and running multiple stacks in parallel becomes the normal operating mode rather than a task the engineer pays to do. The human function moves from shepherding one stack to arbitrating across several: judgment, prioritization, and exception handling. Some of the team are already doing this, others are not.

Code production rate goes up again. Review, CI, integration, and deployment: every downstream station is tested at a higher arrival rate. The ceiling will move. We'll find out where by watching which tail gets fat first.

Every rule and every structural choice in an engineering organization was tuned for a specific arrival rate. Agentic development changes the arrival rate by an order of magnitude across multiple stages in months, not years. The rules don't fail gracefully. They fail at the station where the ratio breaks first. One-commit-per-PR failed at review. The next thing will fail next. The discipline the moment calls for isn't any specific rule. It's the willingness to notice which rule is the current bottleneck and retire it on evidence.

Less prophetic than the original Tsunami. More operational. Six months in, operational is more useful.

The bottleneck leaves engineering

This is the finding the original Tsunami predicted, and it's arriving on schedule.

We're getting complaints from stakeholders that they can't produce enough work for us.

Product, design, and requirements definition are the upstream functions that generate the backlog engineering draws from, and they're now the binding constraint on throughput. This isn't a staffing problem you solve by hiring more PMs. It's structural. Product managers write specs one at a time. Designers own flows one at a time. Stakeholders formulate requests one at a time. Each of them is a personal-stack operator in the same sense that an engineer used to be.

No tool accelerates a product decision that a human is still formulating one at a time. Or more accurately, it can, but only if you apply the same pattern to those functions that you applied to engineering. That's a much larger re-plumbing than anyone signed up for.

This is where most agentic-development writing goes quiet, because the implications are uncomfortable. Engineering throughput used to be the thing the business worked around. If it's now the thing the business has to catch up to, then the operating model of the company has inverted, and almost no company is currently designed for that.

The framing that actually moves leadership on this is economic. "Stakeholders can't produce enough work" is the engineering version of the complaint. It gets nodded at in a staff meeting and forgotten by Friday. The P&L version lands differently: we're leaving dollars on the table because backlog formation rate is too low relative to execution capacity. Every month that gap holds, we miss experiments that never run, ship features a quarter later than we could have, and base decisions on evidence competitors have already moved past. None of that shows up on an engineering dashboard. All of it shows up on the P&L eventually, which is where the rest of the executive team is looking. If leadership doesn't feel the gap economically, it won't move on the gap. And if it doesn't move on the gap, the gap doesn't close.

Capacity has to flow somewhere

Much of our excess capacity right now is going back into engineering itself. We're using it on observability we never had, rails we never smoothed, and insight into what production is doing in a way we couldn't see before. It's also going into the tooling layer that makes the throughput sustainable in the first place: the next generation of our internal workflow tooling, more conventions codified into scaffolding, and the orchestration work that pushes us from human-coordinated stacks to arbiter-coordinated ones. That work was chronically underfunded when every engineer-hour carried an opportunity cost measured in features. Now it doesn't. The window to do that work is real, and it's finite.

The trap is that engineering-improving-engineering is a closed loop with no external forcing function. It feels productive because it is productive by engineering's own metrics. It can absorb unlimited capacity. If the rest of the organization doesn't catch up, engineering will keep reinvesting in itself, and the company ends up with an unbelievably powerful engine that isn't connected to the wheels. The failure mode is quiet. Nothing feels wrong from inside engineering while it's happening.

If non-engineering functions stay serialized while engineering deserializes, the company doesn't get faster. It gets misaligned. Engineering ships at a tempo the rest of the business can't consume. The gap shows up as frustration in both directions. The operating model quietly invalidates itself without anyone noticing the moment it happened. That's a board-level problem, not an engineering one.

The reinvestment window is a bridge, not a destination. Six months in, I don't think the rest of the organization is going to catch up on its own. Serialization points don't dissolve themselves. Every function currently rate-limited by a human formulating work one at a time will stay rate-limited until somebody does to that function what we did to engineering: build the tooling, change the operating model, and retrain the humans for the arbiter role.

That leaves two options, not three. Apply the same transformation outward, or accept the plateau. There's no third path where the gap closes quietly.

Exporting the pattern

We're already doing some of this, which is where the argument stops being theoretical.

Our CRO practice a few years ago ran about 50 tests a year. Through tooling that made it easier to scope, instrument, and launch tests, we 5x'd that. The constraint was cycle time per test. Our current target is to 10x that 5x, on the order of 2,500 tests a year, and we have the first pass on the tooling that gets us there. The shape is the same as the engineering story. A human used to be the serialization point for producing work-units. The tooling positions the human as an arbiter across many units in flight.

The second-order effects matter more than the first-order productivity gain. More tests in flight means more winners, which means more evidence, which means product and marketing formulate decisions with tighter feedback loops, which means the rate at which they produce work engineering can act on goes up. CRO tooling isn't only a CRO productivity story. It's one of the mechanisms by which the rest of the business catches up to the tempo engineering is now capable of.

Customer service and finance are next on the list for us. We haven't built the tooling yet, but both have the shape that makes the pattern work: high volume, repeatable work-units, clear correctness criteria, and judgment exercised against a stable rubric rather than invented fresh each time. The human role shifts to handling exceptions, tuning the rubric, and catching cases where agent confidence is miscalibrated. Same pattern, different work-units.

So where does the arbiter pattern work, and where doesn't it? It works where the function produces instances of a known form. It works awkwardly or not at all where the function produces new forms: strategy, novel design, executive judgment, anything whose output is a frame rather than an instance within a frame. That second category doesn't get cheaper when you apply agentic tooling to it. It gets more valuable. Everything around it gets cheaper, and leverage concentrates on the work that still requires humans.

Here's a diagnostic to run on your own organization. Walk through your functions and ask whether each is producing instances of a known form, or producing new forms. The first is where your next round of throughput comes from. The second is where your senior talent should be concentrated.

The real fork

Every CTO reading this is about to face the same question, and there's no right answer that avoids it. Do you want to own the transformation outward, or not?

Extending the operating-model change into product, design, finance, customer service, and eventually the way decisions get made at the top of the house is where the leverage is. It's also territory that isn't clearly yours. It will create friction. It may not be welcomed by the people whose functions you're proposing to re-plumb. The mandate that was obvious inside engineering isn't obvious outside it, and the political cost of expanding it is real and will be charged to your account.

The alternative is to build the best engine you can inside the engineering box and let the plateau harden around you. Some people will make that choice. It's a defensible choice. It isn't the one I'm arguing for here.

I haven't fully resolved my own version of this question, and I suspect most CTOs reading this haven't either. What I'm certain of is that the third option, the one where the rest of the organization catches up on its own while engineering stays focused on engineering, isn't actually on the menu. Someone does the dissolving, or the plateau hardens. The only question is who.

What the update actually is

The original Tsunami warned that the flood was coming. It also predicted that the bottleneck would move upstream of engineering. Both are now true. What the original couldn't call, and what's clearest six months in, is the compression. The rate at which each successive ceiling arrives.

Agentic tooling dissolves the serialization point wherever it's applied, and the only stable end-state is one where the whole value stream, not just engineering, has been re-plumbed around that fact. Each ceiling yields to the next faster than most organizations can react. That's what feels different from every prior wave of engineering productivity. CI/CD took years to absorb. This is happening in months. The compression is the real tsunami. Not the volume, but the rate at which each successive ceiling arrives.

Companies that apply agentic tooling only in engineering will plateau where we are right now, with stakeholders unable to feed the beast and engineering reinvesting in itself to stay busy. Companies that apply it end-to-end will unlock a different operating tempo, and the work of getting there will be mostly outside engineering. That's work most CTOs aren't set up to do.

Six months in, that's the situation. Not a prediction. A field report. The companies that come through the compression intact will treat every function as a potential serialization point and re-plumb. The ones that treat engineering as the only lever will find themselves in a corner where throughput keeps rising and outcomes don't. The next update will be about whichever joint bursts next.