Launching today

Benchspan

Launching today

Run agent benchmarks in minutes, not hours

57 followers

Run agent benchmarks in minutes, not hours

57 followers

Visit website

AI Metrics and Evaluation

BenchSpan is a benchmarking platform for AI agents. Running benchmarks is slow, expensive, and fragile. We fix that. Onboard your agent once (we onboarded Claude Code in 37 lines), run any benchmark in parallel in the cloud, and get every result in one place your whole team can see. When runs fail halfway, rerun just what broke. Compare runs side by side to see exactly where your agent is improving. Stop fighting your benchmarks and start shipping your agent.

Payment Required

Launch tags:API•Developer Tools•Artificial Intelligence

Launch Team / Built With

JudeAI 2.0 — AI-first real estate command center for serious agents

AI-first real estate command center for serious agents

Promoted

Benchspan

Maker

📌

Hey PH 👋, Ritesh from Benchspan here, We were building AI agents and needed to know if they were getting better. Sounds simple. It wasn't. Every benchmark assumed a different interface, days of glue code just to get running. Full suites took 14 hours on a laptop. A single failure at 72% burned $600 in tokens and we'd start from scratch. Nobody on the team trusted anyone else's numbers because nobody ran the same config. And results? Scattered across CSVs, messages, and spreadsheets nobody could find. We realized we were spending more time fighting our benchmarks than improving our agent. So we built the tool we wished existed. How it works 1. Onboard your agent. Write a small bash script that passes standard inputs to your agent. 2. Pick a benchmark and run 3. Results flow in automatically. Scores, trajectories, errors, timing. Everything captured and tagged with your agent's commit hash so you can compare runs side by side. What you get - Any agent that runs via bash. No framework lock-in. No interface conformance. One-time setup. - Massively parallel execution. Every instance in its own Docker container. 500 instances that took 14 hours take a fraction of the time. - Rerun only what failed. Network error on 37 instances? Rerun those 37. Join the results. Stop paying twice. - Identical environments, every time. Same Docker image, same config, tagged with the exact commit hash. No more "works on my machine." - One source of truth. Every run, every result, every trajectory — tagged, searchable, comparable. The whole team sees the same thing. - Smoke tests. Run 5 instances to validate your setup before kicking off a 500-instance run. Catch bugs cheap. If you're benchmarking agents and have feedback , I'm in the comments 👇

Report

1d ago

@ritesh_malpani Curious about the rerun-only-failures part. If I'm running something like SWE-bench on a custom agent and 40 out of 500 instances fail due to network issues, does the rerun stitch those results back into the original run automatically, or do I end up with two separate result sets I need to merge?

Report

5h ago

JudeAI 2.0 — AI-first real estate command center for serious agents

AI-first real estate command center for serious agents

Promoted

Benchspan

Maker

📌

Report

1d ago

Report

5h ago