Where SLMs beat GPT-5
We’ve been seeing a consistent pattern across agent systems:
GPT-5 works well as a judge on average cases—
but breaks down on edge cases and policy boundaries.
That’s exactly where reliability matters.
In our recent work, we took a different approach:
Generate adversarial edge cases from the spec
Resolve ambiguity via multi-agent debate
Train a task-specific small model (SLM) on that data
Paper: https://huggingface.co/papers/2604.25203
What we’re seeing:
SLMs outperform GPT-5 on boundary decisions
More consistent (less flip-flopping on similar inputs)
Fast enough for real-time, per-interaction evaluation
This leads to a different stack:
GPT-5 → generation
SLMs → evaluation + guardrails
Curious if others are seeing similar behavior in production.
(If relevant, we also turned this into a product: https://www.producthunt.com/products/plurai


Replies
insightful read. what struck me the most: SLMs could produce 43% fewer failures, at a 8x lower cost and in less than 100 ms.
@ilankad23 curious if you could get similar results with other models?