Case study

AI incident response automation: an agent swarm that investigates before you wake

Hinge Health · 2025 · AI incident response

Every incident starts with the same expensive minutes

An alert fires at 3am. The on-call engineer wakes up, opens a laptop, and starts gathering: pull the logs, scan the dashboards, check what shipped recently, figure out which services are actually involved. None of that is a decision. It's the work you do before you're allowed to make one. And it looks the same in every incident, because the shape of an investigation barely changes — only the details do.

At Hinge Health, those minutes carry real weight. The platform serves millions of patients under HIPAA-grade constraints. When something breaks, the cost isn't an uptime stat — it's care delivery, and trust. There's a quieter cost too. Engineers who spend their nights on rote data-gathering get tired, and tired engineers start tuning alerts out. Alert fatigue isn't a comfort problem. It's how real incidents get missed.

So the problem wasn't "incidents take too long" in some general way. It was narrower and more fixable: the first phase of every incident — the investigation — was manual, repetitive, and identical in structure every time. That is exactly the profile of work a machine should be doing.

Why a swarm of narrow agents, not one clever one

There were cheaper-looking fixes. Better runbooks. More dashboards. Both decay: runbooks go stale the week after they're written, and a dashboard still needs a human awake and looking at it. The more tempting alternative was a single powerful agent — one model with access to everything, told to "investigate the incident." We deliberately didn't build that.

A monolithic agent is hard to trust and harder to fix. When its summary is wrong, you can't tell which part of its reasoning failed, so you can't repair it — you can only reprompt and hope. In a healthcare environment, that's not an acceptable failure mode. A confident wrong answer in an incident thread is worse than no answer at all, because it sends a half-awake responder down the wrong path while the clock runs.

So the bet was to split the investigation into narrow, verifiable jobs. One agent owns logs. One owns metrics. One owns recent deploys. One owns related services. No single agent reasons over the whole picture — an orchestrator merges their evidence into one summary. Each agent can be tested in isolation, which means when one misbehaves, you can find it and fix it. And the human boundary stays exactly where it should: the swarm does the gathering, the engineer does the deciding. Nothing about remediation is automated. In this environment, that restraint is the feature.

Incident flow: an alert fires in Datadog, the swarm dispatches three narrow agents in parallel for logs, metrics, and changes, their evidence is synthesized, and a summary lands in the Slack incident thread. Alert fires Datadog Swarm dispatches Logs agent Metrics agent Changes agent Synthesis Slack thread on-call summary
Alert to summarized investigation — before the engineer opens a laptop

What the on-call engineer actually sees

From the engineer's side, the experience is simple: the alert that woke you up already has a thread, and the thread already has a summary. Behind that:

  • The agents are orchestrated with Mastra, which gives the swarm a defined structure — which agents run, fanning out in parallel, converging on one result — instead of a pile of ad-hoc model calls.
  • They pull logs and metrics from Datadog, because that's where the operational truth already lives. The swarm reads the same sources an engineer would, not a parallel copy that can drift.
  • Every agent call is traced end-to-end with LangSmith — non-negotiable in a HIPAA-grade environment, and the thing that makes a bad summary debuggable in minutes instead of guessable in hours.
  • The findings land as a summarized post in the Slack incident thread — the place the on-call engineer was going anyway. No new tool to learn, no dashboard to remember at 3am.

One design rule mattered more than any framework choice: agents fail honestly. If the log data for a service can't be retrieved, the summary says so. It does not stitch a plausible story from whatever it found. The on-call engineer gets an honest partial picture rather than a confident complete-looking one — and that's what made the team willing to rely on it.

Roughly a third faster, and a calmer rotation

Incident response got roughly 30% faster. The gain is concentrated where you'd expect: the investigation phase — the gathering that used to consume the first and most expensive minutes — is largely done before a human arrives. The engineer starts at "here's what we know," not at a blank page.

The second outcome is harder to put a number on but easy to feel: reduced alert fatigue. When responding to an alert means reading a summary and making a call — instead of bracing for twenty minutes of dashboard archaeology — engineers stop dreading the pager. Production issues are easier to stay on top of because the boring first mile of every investigation is already paved.

If your on-call rotation pays the same tax

The specifics here are mine — healthcare, this stack, this team. But the shape of the problem is common, and most of what worked transfers:

  • Automate the gathering, keep the deciding. The investigation phase is mechanical and parallelizable. Judgment isn't. Drawing that line conservatively is what makes the system safe to run in a serious environment.
  • Narrow agents beat one clever one. Verifiability is worth more than capability. If you can't test a piece in isolation, you can't fix it in production.
  • Trace everything from day one. Observability is what turns "the AI was wrong" from an anecdote into a bug report.
  • Honest failure beats confident fabrication. Trust is the actual product. One fabricated summary costs more than fifty honest "I couldn't get this" admissions.
  • Deliver findings where responders already are. A new dashboard is a new thing to forget at 3am.

If your team is spending the first stretch of every incident reassembling the same context by hand, that cost is fixable — and the fix doesn't require betting your incident process on a black box. If you want to think through what this looks like on your stack, I'm happy to compare notes.