Ilan Kadar

Where SLMs beat GPT-5

by

We’ve been seeing a consistent pattern across agent systems:

GPT-5 works well as a judge on average cases—
but breaks down on edge cases and policy boundaries.

That’s exactly where reliability matters.

In our recent work, we took a different approach:

  • Generate adversarial edge cases from the spec

  • Resolve ambiguity via multi-agent debate

  • Train a task-specific small model (SLM) on that data

Paper: https://huggingface.co/papers/2604.25203

What we’re seeing:

  • SLMs outperform GPT-5 on boundary decisions

  • More consistent (less flip-flopping on similar inputs)

  • Fast enough for real-time, per-interaction evaluation

This leads to a different stack:

  • GPT-5 → generation

  • SLMs → evaluation + guardrails

Curious if others are seeing similar behavior in production.

(If relevant, we also turned this into a product: https://www.producthunt.com/products/plurai

15 views

Add a comment

Replies

Best
fmerian

insightful read. what struck me the most: SLMs could produce 43% fewer failures, at a 8x lower cost and in less than 100 ms.

@ilankad23 curious if you could get similar results with other models?