Essay

Everyone has agents now. Almost nobody can operate them.

June 2026

In 2026, building an agent takes an afternoon. Operating one takes a discipline most teams haven't built yet — and the gap between those two facts is where agent projects are quietly dying right now.

A year and a half ago I wrote about the demo-to-production gap — what changes when an LLM has to be reliable at 3am instead of impressive in a meeting. The conversation has moved since then. The frameworks got good. Mastra, LangGraph, and a dozen others made the building part genuinely easy. The question is no longer can you build an agent. It's can you run one — and the honest answer, at most companies shipping agents today, is no. Not because the agents are bad. Because the operations around them don't exist.

Building got easy. That's the trap.

When building was hard, the difficulty was a filter. The teams that got an agent working had usually thought hard about what it should do. Now the filter is gone. You can wire a model to your ticketing system, your codebase, and your production logs before lunch — and so can the intern, and so can the team two floors up who didn't tell anyone.

Here's the test. Your agent has been running for three weeks. Answer these without opening a notebook: What's your task success rate this week versus last? What did it cost yesterday? What's the worst single action it could take right now if it reasoned badly? If you can't answer all three, you don't operate your agent. You host it. There's a difference, and production will eventually explain it to you at a bad time.

Picture the system you actually have to run:

The operations stack, bottom up: agent code is the smallest layer, wrapped by increasingly larger layers of tracing, evals as regression tests, cost ceilings, and — largest of all — blast radius and permissions. Blast radius /permissions Cost ceiling Evals asregression Tracing Agent code built in an afternoon
The agent is the smallest layer of the system you run

Traces tell you what happened. Not whether you're drifting.

Most teams equate agent observability with tracing — every call, every tool invocation, every intermediate output, logged and queryable. Traces are necessary. I've argued for them since day one of building Transformers, the incident-response agent swarm at Hinge Health. But traces are a microscope. They answer "what happened in this run?" They cannot answer "is the system getting worse?"

Agents degrade without a deploy. A model version updates upstream. A tool's API starts returning a slightly different shape. The tickets your agent triages shift in character as your product changes. Every individual trace looks fine; the aggregate is sliding. So you need a second layer above the traces: success rate over time, tool-failure rate by tool, output length and cost distributions, escalation frequency. Boring metrics, on a dashboard, with alerts — exactly what you'd build for any other service. The fact that the component is probabilistic makes the trend lines more important, not less, because no single trace will ever show you the drift.

Evals are regression tests. Treat a prompt change like a deploy.

Every team has a story like this one: someone tweaks a system prompt to fix one annoying behavior, ships it, and discovers a week later that the fix quietly broke three other behaviors nobody thought to check. In traditional software we solved this decades ago — it's called a regression suite, and you don't merge without it green.

Agents need the same thing, and in 2026 there's no excuse not to have it: a versioned set of real scenarios — including the weird ones that burned you before — run against every prompt change, every model upgrade, every tool modification, with scored outputs and a threshold that blocks the merge. Building this is genuinely tedious. Collecting good cases is tedious. Deciding what "correct" means for a fuzzy output is tedious. That tedium is the moat. Teams that have evals iterate fast because every change is checked against everything that ever went wrong. Teams that don't have evals iterate scared — every prompt edit is a bet, and eventually they stop making changes at all. That's how an agent becomes legacy software eight months after it shipped.

Cost needs a ceiling, not a dashboard

An agent's cost is not a function of traffic. It's a function of how confused the agent got. A run that resolves in four steps and a run that loops through forty tool calls retrying a failing query are the same request from the outside — one costs a hundred times the other. Dashboards show you this damage after the fact. Ceilings prevent it: a budget per task in steps, tokens, and dollars, after which the agent stops and hands the partial result to a human, plus a circuit breaker on aggregate hourly spend.

The pushback is always "but then the agent fails the task." Yes. A bounded failure that says "I couldn't finish, here's where I got to" is a feature. An unbounded retry loop at 3am is an incident — and the irony of your incident-response tooling causing the incident is not one you want to explain in the postmortem.

The blast-radius question: what is the agent allowed to do?

Before any capability question, ask the permission question: if this agent reasons as badly as possible, what's the worst thing it can actually do? Not the worst it's prompted to do — the worst its credentials permit. "The prompt says be careful" is not an answer. Prompts are suggestions. IAM scopes, read-only credentials, and allowlisted actions are answers.

We drew this line deliberately with Transformers: the swarm reads logs, metrics, and deploy history, and writes summaries for the on-call engineer. It cannot restart services, roll back deploys, or modify infrastructure — not because the model couldn't be prompted to do it usefully, but because a wrong diagnosis with write access is a second incident layered onto the first. Write actions earn their way in one at a time: reversible first, human approval gates on anything that isn't, and an audit trail for all of it. The teams that skipped this conversation in 2025 are the cautionary tales of 2026.

When a swarm beats a monolith — and when it's just distributed debugging pain

The argument for multiple agents is the argument for any decomposition: small, verifiable scope. In Transformers, each agent owns one slice — logs, metrics, recent deploys — which means each can be tested, eval'd, and fixed in isolation. When the incident summary is wrong, the trace shows which agent produced the bad slice. That's the operational payoff: a swarm of narrow agents is diagnosable.

But decomposition has a price, and in 2026 too many teams are paying it for nothing. Five agents passing context to each other is five places to lose information, five prompts to keep coherent, five eval suites to maintain — and if your agents are chatty with each other, you've reinvented the distributed monolith with probabilistic RPC. The test is the same one you'd apply to microservices: split when the pieces have genuinely independent responsibilities you'll test and evolve separately. If you're splitting because the architecture diagram looks more impressive, you're not buying capability. You're buying debugging pain on an installment plan.

The discipline is the product now

The 2024 question was whether the model could do the job. It mostly can. The 2026 question is whether your organization can run the system around it — see it drifting, test its changes, cap its costs, bound its blast radius. Those aren't AI skills. They're operations skills, the same ones that separate teams who run software from teams who merely write it.

Which is the reframe worth sitting with: the agent capability race is effectively over for most use cases — everyone's agents can do roughly the same things. Operating them is the competition now. And operations, unlike a framework, can't be installed. It has to be built, the slow way, by your team. Start before the incident does.