We couldn't find an open benchmark for AI-generated API tests, so we built one
Every API testing eval we found either required source code access, relied on rich documentation, or measured output format rather than whether a test would catch a real failure.
So we built APIEval-20. Twenty scenarios across e-commerce, payments, auth, scheduling, and user management. Each scenario gives a model exactly two things: a JSON schema and a sample payload. No implementation details, no docs, no further context. The model has to generate a test suite from that alone.
The bugs are planted in live reference implementations. A bug is only caught if a generated test produces a response that deviates from correct behavior when run against the implementation.
Submit through the hosted eval harness and get a score back.
Scoring weights bug detection at 70%, API surface coverage at 20%, and test efficiency at 10%.
We built this because we kept asking ourselves: how well can an AI agent actually think like a QA engineer? Most benchmarks measure whether a model produces syntactically correct output. This measures whether it does useful work.
It's free to use. Submit your model's output and see where it lands. Methodology and further details here. Dataset available here.
Curious whether others have tried evaluating AI agents on actual bug-finding rather than code generation quality.


Replies