Essay

Making AI actually useful in production

February 2025

The demo always works. You assemble the right context, write the right prompt, stream the output to a terminal, and it looks like magic. That's where most AI projects live — and die.

Getting an LLM to do something impressive once is a solved problem. Getting it to do something useful reliably, at 3am, when your team is asleep and the stakes are real — that's a different problem entirely. I learned this building Transformers, the incident-response agent swarm at Hinge Health. The model was never the hard part.

The gap is everything around the model

The gap isn't model capability — the model is usually good enough. It's everything else: context that arrives incomplete, a downstream tool that times out, a confident answer that's wrong, an invocation bill ten times your spreadsheet.

The demo-production gap: in production, the model is the smallest box in the system.

Production isn't a stress test of the happy path. It's a relentless exploration of the failure modes you didn't design for. A bug in a function is deterministic; a hallucination isn't — so you build for observation instead.

What "useful" actually means

Before any AI feature, ask one question: what does useful mean here, specifically? Not impressive. Not clever. Useful — to a specific person, in a specific moment, in a way they'd notice if it broke.

For Transformers, that person is the on-call engineer. Three dimensions matter:

Reliability. The bar depends on stakes: a writing assistant that misfires costs a shrug. An incident summary that misfires at 3am costs trust you won't earn back.
Observability. You can see what the system did, why, and where it went wrong. Without this, debugging AI is archaeology.
Cost discipline. The unit economics work at the scale you actually run. Fine at 100 invocations a day can be catastrophic at 10,000.

Patterns that held up

Narrow, verifiable scope

The worst AI features try to do too much; the best do one verifiable thing. Each Transformers agent owns one slice of the stack — Datadog logs, metrics, recent deploys, related services — and only the orchestrator reasons across the whole picture. So you can test each agent in isolation, and fix it when it breaks.

Tracing from day one

We built the swarm on Mastra and ran LangSmith traces from the first week, as the way we understood the system — not an afterthought when something broke. Every agent call, every tool invocation, every intermediate result, logged and queryable. When a summary came out wrong, the trace showed exactly which agent produced which output — failure pinpointed in minutes. Without traces, you're guessing.

Graceful failure over false confidence

The most dangerous AI system is one that fails silently. If a tool call times out, you want the agent to say "I couldn't get the log data for this service" — not stitch a plausible summary from partial data. We built explicit fallbacks at every tool call, so the on-call engineer gets an honest picture, not a confident hallucination.

The lesson I keep relearning

AI in production isn't a machine learning problem. It's a systems design problem. The model is a component — a powerful, weird, probabilistic one — but still just a component. It needs what every critical component needs: defined failure modes, observability, graceful degradation, and a clear answer to "what does success look like and how would we know?"

Get that right, and the model will surprise you with what it can do. Get it wrong, and you'll spend your nights putting out fires the model lit.