Launching today

PandaProbe

Launching today

open source agent engineering platform

475 followers

open source agent engineering platform

475 followers

Visit website

Observability tools

PandaProbe is an open-source agent engineering platform that gives you deep observability into AI agent applications. Use it to trace, evaluate, monitor and debug your AI agents in development and production.

Free

Launch tags:Open Source•Developer Tools•Artificial Intelligence

Launch Team / Built With

Wispr Flow: Dictation That Works Everywhere — Stop typing. Start speaking. 4x faster.

Stop typing. Start speaking. 4x faster.

Promoted

PandaProbe

Maker

📌

👋 Hey Product Hunt!

I’m Sina, founder of PandaProbe.

Building AI agents is getting easier, but understanding and trusting them in production is still hard.

Once agents start calling LLMs, tools, APIs, MCPs, and sub-agents, logs aren’t enough anymore. You need to see what happened, why it failed, whether quality regressed, and how reliable the system is across full sessions.

PandaProbe is my attempt to solve this: an open-source agent engineering platform for tracing, evaluation, monitoring, and debugging AI agent applications.

The goal is simple: help developers move from “it works on my laptop” to “I understand production behavior, can measure quality, and continuously improve it.”

What PandaProbe provides

🔎 Trace — capture full agent executions as sessions, traces, and spans across LLMs, tools, agents, and custom logic.
📊 Evaluate — score traces and sessions using mission-critical, agent-specific metrics.
⏱️ Monitor — schedule recurring evaluations to automatically validate new traces and sessions in production.
📈 Analytics — track performance, cost, latency, errors, and quality trends over time.
🛠️ Open source + cloud — use the open-source core on GitHub or run PandaProbe in the cloud.

Who it’s for

🧑‍💻 AI engineers — debug agent behavior across LLMs, tools, and workflows.
🏗️ Platform teams — monitor quality, regressions, and reliability in production.
🔬 Builders experimenting with agents — understand failures and iterate faster.
🚀 Startups — add observability and evaluation before things become unmanageable.reason about.

Quick links

GitHub: https://github.com/chirpz-ai/pandaprobe

Docs: https://docs.pandaprobe.com

Cloud: https://www.pandaprobe.com/

I’ll be here all day answering questions and collecting feedback.

If you’re building agents today, what’s the hardest part to debug or evaluate?

Thanks for checking it out 🙏
— Sina

Report

4d ago

Okan

Handling state and debugging for long-running autonomous agents is usually a nightmare, so having an open-source platform to standardize that workflow is huge. I can definitely see myself using PandaProbe to self-host my agent evaluation pipeline to keep sensitive client data entirely local. I am really curious to hear if you currently support custom tracing for raw API calls instead of just the standard frameworks.

Report

9h ago

PandaProbe

Maker

@y_taka Really appreciate that — and yes, democratizing observability and enabling teams to keep sensitive workflows fully self-hosted were big motivations behind making PandaProbe open source from day one.

And absolutely: we support custom tracing beyond standard frameworks. Alongside native integrations, PandaProbe also provides manual instrumentation APIs and decorators, so you can trace raw API calls, internal services, custom orchestration layers, or essentially any part of your agent workflow you want visibility into.

A lot of teams end up with hybrid architectures, so supporting low-level custom instrumentation was important for us early on.

Report

9h ago

Evaluation is the hardest part of this whole space and most platforms hand-wave it. The failure mode that actually bites in production isn't crashes or schema errors. It's slow drift in subjective quality (voice, classification accuracy, output style) that only shows up when a human reads 50 outputs in a row. How does PandaProbe handle that in practice? LLM-as-judge with custom rubrics, human-in-loop on a held-out set, embedding-distance from a golden corpus, or something else? And how do you stop eval cost from outpacing inference cost when you're re-judging every trace?

Report

1h ago

PandaProbe

Maker

@vincentf This is actually one of the core motivations behind PandaProbe.

A lot of evaluation systems today focus on isolated outputs, but in production we kept seeing failures emerge as trajectory-level drift: looping, degraded tool grounding, coordination breakdowns, subtle quality regression, output-style drift, etc. The model can still sound confident locally while the overall session quality quietly collapses.

That led directly to a research paper I recently published called TRACER (Trajectory Risk Aggregation for Critical Episodes in Agentic Reasoning). The core idea is evaluating uncertainty and failure at the trajectory/session level rather than the individual response level.

That research became a major foundation for PandaProbe’s evaluation system and heavily shaped how we think about observability and longitudinal agent evaluation.

On the cost side, we’re also very conscious about evals becoming more expensive than inference. PandaProbe supports async and sampled evaluations, composable metrics, and lightweight structural trajectory signals so teams don’t have to run expensive judge models on every trace.

Report

16m ago

FuseBase

Congrats on another great product going live! does it support MCP tool tracing natively or do you have to instrument those calls manually?

Report

15h ago

PandaProbe

Maker

@kate_ramakaieva Thanks for the support, Kate! Great question.

If you’re using one of our supported integrations for frameworks like LangGraph, CrewAI, and others, MCP tool calls are automatically captured and traced out of the box.

For custom agent architectures or internal tooling, we also provide lightweight manual instrumentation via decorators, so you can trace virtually any function, tool call, or workflow step in your agent logic.

Report

14h ago

Really nice work. The gap between "it ran" and "I understand what happened" is enormous for agents and nobody's solved it cleanly yet. Rooting for you!

Report

11h ago

PandaProbe

Maker

@igorsorokinua Really appreciate that and I completely agree. That gap becomes painfully obvious once agents start interacting with tools, memory, APIs, and other agents in production.

A big part of PandaProbe’s vision is making agent behavior actually inspectable (like traditional software engineering) and understandable instead of feeling like a black box.

Report

8h ago

MyLens AI

Great pain to tackle, Sina. Good luck.

Report

6h ago

PandaProbe

Maker

@ardalan2 thanks for your support Ardalan. happy to hear your feedback if you adopt the platform.

Report

6h ago

mcp-use

Congrats on the launch and thanks for using mcp-use :)

Report

14h ago

PandaProbe

Maker

@pederzh thanks for the support Luigi. I'm a big fan of mcp-use :)

Report

14h ago

Forum Threads

p/pandaprobe

•

2d ago

why are ai agents still so hard to debug in production?

feels like the industry figured out how to build ai agents faster than how to understand them.

everyone demos agents.
very few teams can confidently answer:

why an agent failed
what changed between runs
whether quality is improving or regressing
or if the agent is actually reliable over time

curious how people here are handling this today.

View all

👋 Hey Product Hunt!

I’m Sina, founder of PandaProbe.

Building AI agents is getting easier, but understanding and trusting them in production is still hard.

PandaProbe is my attempt to solve this: an open-source agent engineering platform for tracing, evaluation, monitoring, and debugging AI agent applications.

The goal is simple: help developers move from “it works on my laptop” to “I understand production behavior, can measure quality, and continuously improve it.”

What PandaProbe provides

GitHub: https://github.com/chirpz-ai/pandaprobe

Docs: https://docs.pandaprobe.com

Cloud: https://www.pandaprobe.com/

I’ll be here all day answering questions and collecting feedback.

If you’re building agents today, what’s the hardest part to debug or evaluate?

Thanks for checking it out 🙏
— Sina