I've tried running OpenClaw myself and it's kind of a nightmare. You get it working, feel great about it, then wake up the next morning and it's just... dead. KiloClaw fixes the actual annoying part. Click a button, agent is running in under a minute, and it stays running. The fact that it's built on the same infrastructure powering 1.5M+ Kilo Code users means it's not some fly-by-night hosting wrapper. 500+ models, zero markup on tokens, and if you already use Kilo Code your account and credits just carry over. Genuinely impressed.
Nice benchmarks at the end of the use cases! I would like to see more benchmarks relocated to different levels of tasks (no coding).
@clearloop thanks for the support, Tianyi!
the benchmark currently includes 23 tasks across different categories and the @KiloClaw team is planning to improve it, targeting 100 tasks on a wider range of use cases.
any specific tasks in mind? adding @realolearycrew in the loop
@realolearycrew @fmerian
Oh sorry, I meant levels of the difficulties of the tasks, enlarge the size of the tasks could be sort of repeating jobs for testing LLMs that lots of projects already done which could be endless and no meaningful results, the interesting thing is actually on use cases, things could be different, for example
L1 Task (we can have 100 of this): send email with specified content to Alice, steps:
get the content of the email (I believe there will be no LLM fail on this), do not lose any content, or mixing the content with user instructions
search the MCPs or skills for this (variable: if the mcps or the skills already in the LLM's context, using MCPs or skills), if a cheap LLM would fail on this?
send the email (if any LLM inputs wrong arguments, failed to find the right interface to do so)
done, the result could be, for this task, `qwen3.5:0.8B` has the same performance as opus-4.5 and it's saving 99% token costs
L2...LN, maybe for L3 tasks, `qwen3.5:0.8B` can not handle them at all
I previously had a post about this, see also https://www.crabtalk.ai/blog/agent-capability-benchmarks
Kilo Code
@clearloop we are actively seeking feedback on the other types of tasks we should add!
Can you add your thoughts here: https://github.com/pinchbench/skill/issues/52
We also want it to be wider and cover lots of non-coding functions
When setting up your @OpenClaw, you might wonder what the best AI model for your agent is. PinchBench just lets you know.
TL,DR: It's @OpenAI's GPT-5.4... for now!
S/O to @realolearycrew for building it 👏👏 - Give it a star on GitHub
@fmerian There should be a spoiler alert warning here😅
oops 🙈
ClawSecure
@realolearycrew @fmerian Benchmarking across success rate, speed, AND cost in one system is exactly what's been missing. Most model comparisons focus on one dimension, usually just quality, and ignore the tradeoffs that actually matter when you're running agents in production.
We operate multiple AI models across different workflows internally and the biggest decision isn't "which model is best" but "which model is best for THIS specific task at THIS cost threshold." A model that's 90% as good at 7% of the cost is the right choice for routine tasks. A model that catches edge cases other models miss is worth the premium for security-critical work. Having standardized benchmarks across real-world OpenClaw coding tasks gives developers the data to make that routing decision instead of guessing.
The fact that this runs against real-world tasks and not synthetic benchmarks is key. We see the same thing in security scanning: synthetic test cases tell you how a tool performs in ideal conditions. Real-world data tells you how it performs on the messy, unpredictable code that developers actually ship. Real-world benchmarks are always more valuable.
The OpenClaw ecosystem needed this. As the agent framework grows and more models compete for developer adoption, having an independent, standardized way to evaluate performance helps the entire community make better decisions. Congrats to the Kilo Code team on the launch!
@realolearycrew @jdsalbego Thanks for the kind words, JD!
What models are you using when building @ClawSecure? (and how do they stack up??)
Product Hunt
Kilo Code
@curiouskitty I think that SWE bench is a great benchmark for Software Engineering tasks. The whole point of PinchBench is that u think OpenClaw goes so far beyond development work to all knowledge work and even personal assistant type tasks. So my goal is for PinchBench to reflect that more than just software engineering
Great question - They do run benchmarks continuously as new models are released. For the record, the latest leaderboard update was on March 21st (5 days ago), and the current best scores:
@OpenAI's GPT-5.4: 90.5%
@Qwen 3.5-27B: 90.0%
@Qwen 3.5-397B-A17B: 89.1%
How does your model stack up? 😸
Kilo Code
@anusuya_bhuyan typically we have new models up within a few hours. Although we also have partnerships with inference providers that can make that even faster.
For example we had a “stealth” version of Nemotron 3 Super before it even launched 😃
@realolearycrew any on-going "stealth" models to play with? 👀
Okay, this is genuinely useful. I've been picking models for coding tasks based on whatever benchmark thread showed up in my feed that week, which is a terrible way to make that decision.
The cost dimension is what gets me. Success rate matters, but if a model takes 3x longer and costs 4x more to get there, that changes the math completely, depending on what you're building. Glad someone's actually measuring all three together.
Curious how you're defining task success — is it automated test output or is there a human eval component? That part always feels like the hardest thing to get right in coding benchmarks.
Congrats on shipping. The 🦀 was not lost on me.
Kilo Code
@ryszard_wisniewski Thank you for your support!
The best part is that you get to shape it because the benchmark is open source, and you can submit your own tests. More on this here: https://blog.kilo.ai/p/pinchbench-v2-call-for-contributors
This is exactly what I was looking for. However, tasks should be scoped and agents should be ranked depending on task category.
Imho the most important agent to determine is the main one, the orchestrator, the one you talk to. But then, you will eventually want different subagents specialized in different tasks (and ideally not as expensive, depending on the task at hand). For those, the "best" agent (in terms of value for money) could be something else (i.e., for a simple but broad internet search, gemini flash is often more than enough).
Kilo Code
@wtfzambo1 Totally agree! Have you tried Auto Balanced model model in KiloClaw? That's exactly the idea behind it: smarter, more expensive models for architecting and orchestrating - cheaper ones for execution
@wtfzambo1 Give it a spin and let us know what you think in a review! producthunt.com/products/kiloclaw/reviews/new
How do you make sure the results from PinchBench reflect real-world use especially when different projects have different complexity, tools and edge cases?