[01]Article

Your AI Agent Just Fixed Production at 3am. Now What?

The first companies deploying autonomous SRE agents are discovering the hard part isn't the fix, it's everything that comes after.

James Roycroft-Davis··4 min read·For operators

Port had an incident two weeks ago. Three different teams got paged. Each one investigated the same problem in parallel, burning nine engineer-hours before anyone realized they were duplicating work.

That's the before picture. The after picture: Port built an AI agent that would have diagnosed the issue in 12 minutes, identified the root cause, and prevented the triple-team confusion. But here's what they learned next. The agent fixing the problem was just the beginning.

The Morning After Problem

StackGen's Arul Jegadish Francis calls it the "3am paradox." Your AI SRE wakes up at 2:47am, diagnoses a misconfigured Horizontal Pod Autoscaler that's starving your service, fixes it, and goes back to sleep. You wake up at 7am to find production is fine. The incident is resolved. The metrics look good.

Now you have questions. What exactly did the agent change? Why did it pick that fix over others? Should you be worried it'll make the same fix again tonight?

"The agent returned HTTP 200. The LLM call succeeded. The tool invocation completed," notes Zylos Research in their trace-driven debugging paper. "And yet the output is wrong."

This is the wall everyone hits. Your agent works. It fixes things. But you need more than fixes. You need understanding, accountability, and confidence that tomorrow's fix won't break something else.

Building the Post-Fix Playbook

The most mature teams are developing what AgentMode AI calls the "inverted runbook." Traditional SRE runbooks assume you control the actor: a deployment, a database, a service. Agent runbooks assume the opposite. The actor is a system that took an action you didn't directly authorize, against production, possibly while you were asleep.

Here's what the leaders are implementing:

Immediate Trace Capture Every decision point gets logged. Not just what the agent did, but what it considered and rejected. Antigravity Lab's framework captures the full decision tree: "detected anomaly X, considered fixes A/B/C, selected B because Y, rejected A because Z."

Automatic Rollback Triggers The agent doesn't just fix and forget. It sets up monitoring on its own changes. If metrics degrade within a time window, it can reverse its own fix. More importantly, it alerts humans when it does.

Morning Briefing Mode Instead of waking up wondering what happened, you get a summary. What broke, what the agent tried, what worked, what side effects to watch for. One company calls it their "agent shift handoff."

Regression Test Generation Zylos Research's approach: every agent-resolved incident becomes a test case. The agent's fix gets validated. Its reasoning gets checked. The same failure scenario gets simulated to ensure the fix is repeatable.

The Trust Gradient

Not every fix gets the same treatment. Teams are developing what amounts to a trust gradient for their agents.

Level 1: Alert and recommend. The agent diagnoses but doesn't touch anything. Level 2: Fix with approval. The agent proposes a fix and waits for human confirmation. Level 3: Fix and inform. The agent acts, then immediately notifies. Level 4: Fix and report. The agent acts and includes it in a daily summary.

Most teams start everything at Level 1. As specific scenarios prove reliable, they graduate up the trust ladder. A simple container restart might reach Level 4 quickly. A database migration stays at Level 2 forever.

What Breaks Next

The early adopters are finding new failure modes. An agent that fixes a memory leak by restarting pods might mask a deeper issue. An agent that rolls back a bad deployment might also roll back the fix that was bundled with it.

"The agent selected the wrong tool at step 3, its arguments were subtly malformed, and every subsequent step built on corrupted context," Zylos researchers found in one post-mortem.

The solution isn't to restrict the agents. It's to build better observability into their decision-making. When your agent makes 50 micro-decisions to resolve an incident, you need to be able to replay, understand, and verify each one.

The New SRE Workflow

The job isn't going away. It's shifting. Instead of fixing incidents at 3am, SREs are:

  • Reviewing agent decisions and improving their reasoning
  • Building better rollback mechanisms for agent actions
  • Creating test scenarios from real incidents
  • Setting trust levels for different types of interventions
  • Teaching agents about system dependencies they missed

One SRE lead put it simply: "I used to fix problems. Now I fix the thing that fixes problems."

The tools are emerging. The patterns are stabilizing. The early adopters have moved past asking "can an agent fix this?" to "what happens after it does?" That's where the real work begins.

[02]Sources

  1. The Agent Incident Runbook — detect, contain, roll back, post-mortem
  2. Designing Production Incident Runbooks for Antigravity Agents: A Practical Framework from Detection to Recovery | Antigravity Lab
  3. How to Automate Incident RCA with AI SREs
  4. How An Incident Agent Would Handle A Port Incident
  5. Trace-Driven Debugging for AI Agent Failures: From Production Incident to Regression Test | Zylos Research

Ready to put this into practice?

Get a Human in Residence
Build your team →