We’re experimenting with AI-assisted DevOps incident recovery — would you trust this in production?

by•4mo ago

We’re building Unideploy, a DevOps automation platform that integrates directly with Claude / ChatGPT via MCP — no separate UI, no new dashboards.

The idea we’re exploring now is AI-assisted incident recovery:

Instead of jumping between CloudWatch, kubectl, CI/CD logs, and Slack during an incident, you ask the AI:

“Production API latency is high. What changed and what’s the safest way to fix it?”

Behind the scenes:

The AI gets real metrics, logs, and recent change history
It does not execute anything on its own
It proposes safe recovery options (rollback, scale, restart, config revert)
Each option includes risk, blast radius, and cost impact
A human explicitly approves before anything runs

The goal is not “AI agents replacing DevOps”, but:
👉 Reducing decision stress during incidents
👉 Making production changes safer
👉 Capturing incident knowledge so fixes aren’t lost

Curious to hear from:

DevOps / SREs: what part of incident response hurts most?
Founders: would this increase confidence in on-call teams?
Skeptics: what would make you not trust this?

We’re early and validating — honest feedback welcome.

33 views

Replies

Best

@suryansh_gupta766

Quick question on the implementation: I’ve spent some time in the observability space (specifically with OpenObserve) and I'm curious how you're handling the 'data mountain.' When a production API spikes, you're often searching for GBs of logs/metrics in seconds. How are you crunching GBs of data to feed the LLM without hitting token limits or latency issues?

Report

2mo ago

@ashish_kolhe2 we don’t send GBs of logs to the LLM. That would blow up both latency and token limits.

What we do instead is narrow things down a lot before the LLM even gets involved.

When something like an API spike happens, we first use signals from AWS (metrics, alerts, traces) to pinpoint the time window and affected services. That already cuts the search space significantly.

From there, we run targeted queries — for example:

only error logs or

recent IAM or config changes

Then we process that data:

group similar errors together

remove duplicates and

turn metrics into simple trends

By this point, what started as GBs of data becomes a small, structured context.

That’s what the LLM sees.

Report

2mo ago