Metoro
AI SRE that detects, root causes & auto-fixes K8s incidents
305 followers
AI SRE that detects, root causes & auto-fixes K8s incidents
305 followers
Metoro is an AI SRE for systems running in Kubernetes. Metoro autonomously monitors your environment, detecting incidents in real time. After it detects an incident it root causes the issue and opens a pull request to fix it. You just get pinged with the fix. Metoro brings its own telemetry with eBPF at the kernel level, that means no code changes or configuration required. Just a single helm install and you're up and running in less than 5 minutes.









Nice! I think we could use this at Asteroid. I'm interested to know how you've thought about keeping it secure when things go wrong
Metoro
@joe_hewett1 Thanks Joe!
Yep there's a couple levels
In cluster components
All of our monitoring is done out of process via eBPF which lives effectively isolated in the linux kernel. So lets say we have a bug, it isn't possible for us to affect your services, worst case we wouldn't be collecting telemetry.
Data Security For The Agent
All agents run without internet access and have tight RBACs so they can only see specific subsets of data. This way it's not possible for the agent to accidentally exfiltrate data
Congrats, looking forward to trying it.
Is it just kubernetes or does it also work on apps too?
Metoro
@saturnin_pugnet So if those apps are running in Kubernetes then we work on those too. We hook into the source code via a github integration so we can debug application level issues too.
The classic use case is that a bad deploy ships, we can see exactly which parts of the code change and deeply investigate the endpoints that have been changed to understand if there's an actual regression based on our telemetry
@chrisbattarbee Generating telemetry at the kernel level with eBPF to remove the instrumentation overhead is a strong approach. That part makes a lot of sense, especially given how inconsistent telemetry can be across services and teams.
The part that feels much harder is the auto-fix layer. In real systems, issues are rarely isolated. You often have partial signals, cascading failures, or symptoms that look like root causes. In those cases, even getting the diagnosis right is non-trivial, let alone generating a fix that is safe to apply.
How do you validate that a generated PR is actually safe in production and not just technically correct in isolation? For example, avoiding cases where the fix resolves one symptom but introduces regressions elsewhere or conflicts with existing infra assumptions.
I’ve been working in a similar space on the code side with Codoki.ai (AI code review and automated fixes), and even at that level, ensuring suggestions are reliable and not contextually wrong is a constant challenge, especially as systems get larger and more complex. So pushing this into infra-level auto-remediation is a big step.
Would be interesting to understand how you’re handling validation, rollback strategies, or confidence scoring before applying fixes.
Congrats on the launch.
@chrisbattarbee @moh_codokiai Thanks Muhammad!
Really good question and I agree, code fixes are one of the hardest parts. To be clear, we don’t do blind auto-remediation.
Metoro is human-in-the-loop: it investigates the issue, identifies the likely initiating failure, and then suggests a code fix (which you can open a PR) for an engineer to review or for a coding agent to work on it further. We do this very intentionally to make sure that the fixes go through the same safety mechanisms that a normal release would go through.
To reduce the risk, the suggestion is grounded in eBPF telemetry, topology, infra context, recent deploys/config changes, and the actual code path, so we are not just reacting to one noisy symptom. Its crossed checked against telemetry, infra and code.
Then once the change is deployed, we verify the rollout against production telemetry to see whether it actually resolved the issue or caused regressions (the ai deployment verification feature).
It’s not a fully autonomous remediation system yet, but it is designed to get teams 80%+ of the way to resolution.
Collective.work
AI SRE using eBPF to collect telemetry definitely seems like the way to go - I was dreaming of such a solution, could you onboard me @chrisbattarbee ? Looks amazing would love to have a chat and test it !
Metoro
@paul_vidal Thanks Paul!
For sure, onboarding is however you like! Either just install it yourself (after logging in you'll be given the single helm command) or if you book a meeting here https://cal.com/team/metoro/engineer I'll be sure to pick it up and run you through it :)
@chrisbattarbee Interesting direction. Most tools stop at alerts and dashboards, going into auto-fix is a big step. How do you handle edge cases where the issue isn’t clearly defined or spans multiple services?
@chrisbattarbee @josh_bennett1
Great question! If the issue spans multiple services, multiple investigation agents are spawned across the affected paths instead of assuming one service is the problem.
They follow the dependency graph from eBPF-generated traces and investigate each branch using traces, logs, metrics, k8s state, deploy/config diffs, and memory (what it already knows about that services behaviour). That lets us separate the first real failure from the downstream.
If there is a clear initiating fault, we identify it. If there isn’t, we surface the causal chain and candidate failure points with evidence instead of pretending there is one neat root cause.
Nectar
@chrisbattarbee and @ece_kayan Good stuff! In my experience, you’re spot on about how heterogeneous and inconsistent observability is in practice. I’m going to try it out and might ping you for a chat.
Metoro
@ece_kayan @shrir Amazing, thanks Shrirang!
Ogoron
This is a very compelling direction, moving from observability to actual autonomous remediation is a huge step for SRE workflows.
Love the idea of going from detection → root cause → PR with a fix, especially without requiring code changes. The eBPF + zero-config setup makes it even more impressive.
We also launched on Product Hunt today — building Ogoron, an AI system that automatically generates and maintains test coverage as products evolve. Different part of the lifecycle, but very aligned in spirit: reducing the manual overhead of keeping complex systems reliable :)
Good luck with the launch!
Metoro
@yanakazantseva1 Thanks Yana, best of luck with your launch too! Ogoron seems pretty cool :)