OpenClaw is the most popular open source AI agent on the planet. Running it yourself? That's the hard part. KiloClaw is a fully managed, hosted version of OpenClaw. We handle the infrastructure, security, updates, and monitoring so you can focus on what your agent actually does - not keeping it alive.
This is the 2nd launch from KiloClaw. View more

PinchBench
Launched this week
PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. We run the same set of real-world tasks across different models and measure success rate, speed, and cost to help developers choose the right model for their use case.
PinchBench is made with 🦀 by Kilo Code, the makers of KiloClaw.


Launch Team




How do you make sure the results from PinchBench reflect real-world use especially when different projects have different complexity, tools and edge cases?
You're spot on - Most LLM benchmarks test isolated capabilities. PinchBench tests what actually matters:
Tool usage: Can the model call the right tools with the right parameters?
Multi-step reasoning: Can it chain together actions to complete complex tasks?
Real-world messiness: Can it handle ambiguous instructions and incomplete information?
Practical outcomes: Did it actually create the file, send the email, or schedule the meeting?
The benchmark currently includes 23 tasks across different categories, and the team is looking for contributors to add more (target: 100).
Let's build the best benchmark for @OpenClaw 🦞
With PinchBench testing real world tasks instead of synthetic benchmarks, how do you decide which tasks go into the benchmark suite and how often do you rotate them to avoid overfitting? Congrats on the launch!
Good question - The objective here is to test what actually matters.
PinchBench currently includes 23 tasks across real-world categories (productivity, research, coding...), and the team is looking for contributors to reach 100 tasks that reflect the kinds of tasks @OpenClaw is actually being used for in practice.
See the public repository on GitHub for more details
CrabTalk
Nice benchmarks at the end of the use cases! I would like to see more benchmarks relocated to different levels of tasks (no coding).
@clearloop thanks for the support, Tianyi!
the benchmark currently includes 23 tasks across different categories and the @KiloClaw team is planning to improve it, targeting 100 tasks on a wider range of use cases.
any specific tasks in mind? adding @realolearycrew in the loop
CrabTalk
@realolearycrew @fmerian
Oh sorry, I meant levels of the difficulties of the tasks, enlarge the size of the tasks could be sort of repeating jobs for testing LLMs that lots of projects already done which could be endless and no meaningful results, the interesting thing is actually on use cases, things could be different, for example
L1 Task (we can have 100 of this): send email with specified content to Alice, steps:
get the content of the email (I believe there will be no LLM fail on this), do not lose any content, or mixing the content with user instructions
search the MCPs or skills for this (variable: if the mcps or the skills already in the LLM's context, using MCPs or skills), if a cheap LLM would fail on this?
send the email (if any LLM inputs wrong arguments, failed to find the right interface to do so)
done, the result could be, for this task, `qwen3.5:0.8B` has the same performance as opus-4.5 and it's saving 99% token costs
L2...LN, maybe for L3 tasks, `qwen3.5:0.8B` can not handle them at all
I previously had a post about this, see also https://www.crabtalk.ai/blog/agent-capability-benchmarks
Kilo Code
@clearloop we are actively seeking feedback on the other types of tasks we should add!
Can you add your thoughts here: https://github.com/pinchbench/skill/issues/52
We also want it to be wider and cover lots of non-coding functions
Features.Vote
the "focus on what your agent actually does, not keeping it alive" framing hits different when you've actually tried to self-host something like this. the infrastructure part isn't just tedious. it becomes the thing that distracts you from the whole reason you set it up
the pinchbench benchmarking layer is the underrated part here. most people pick a model based on vibes or generic leaderboards that aren't specific to their workflows. having real-world task data for openclaw use cases specifically changes what "best model" even means
Kilo Code
Not just Jensen - y'all gotta know which model's best for your claws!
And y'all can contribute to it, because it's open source 🫶
Great job @realolearycrew !!
@realolearycrew is the 🐐