Launched this week
Vibe training for AI agent reliability. Describe what your agent should and should not do — Plurai generates training data, validates it, and deploys a custom model in minutes. It feels like vibe coding, but for evaluation and guardrails. No labeled data. No annotation pipeline. No prompt engineering. Under the hood, small language models deliver sub 100ms latency, 8x lower cost than GPT as judge, and over 43% fewer failures. Always on, not sampled. Built on published research (BARRED).













TabAI
The multi-turn simulation piece is interesting.
Single prompt evals are easy, but most real failures happen across a sequence of interactions.
If this actually captures that well, that’s a meaningful step up from most eval tooling I’ve seen.
Plurai
@igor_martinyuk exactly. that's one of the challenges we have been facing and a main differentiator
Plurai
@igor_martinyuk Exactly! most real failures aren’t single turns, they’re stateful across interactions.
That’s why we simulate multi-turn flows and generate edge cases across the sequence, not just isolated prompts including those “looks fine at each step, breaks at the end” scenarios.
Curious go hear what kind of multi-turn failures have you seen most often?
Plurai
@igor_martinyuk We're here if you have any more questions! Let us know what you think once you try it out!
Plurai
@igor_martinyukglad you love it!
Vibe training is such a good framing, finally something that matches how teams actually think about agent behavior. cheers team 🙌
BTW, what happens when two guardrails conflict with each other at runtime?
Plurai
@boyuan_deng1 thank you :) we're also obsessed with the framing 🤩
each guardrail returns its classification and reasoning - and your "state machine" can figure out how to mitigate between the two having the full context
Plurai
@reut_v_plurai great answer, and "state machine" is exactly the right mental model here 🎯
Plurai
@boyuan_deng1 means a lot, we're obsessed with that framing too 🙌 @reut_v_plurai nailed the answer below — each guardrail returns its classification + reasoning, so your logic layer has full context to resolve conflicts. Not just verdicts, actual signal.
Plurai
@boyuan_deng1 did you get a chance to try the product! Curious what you think
Plurai
@boyuan_deng1 each guardrail returns its classification, confidence and reasoning so you have the full context
Oh, this looks really cool, esp the idea of running evals on every interaction (not just samples). Just curious, how it performs on more subjective tasks though))) And congrats on the launch, btw :)
Plurai
@natalie_ermishina Great question, Natalie! We use an 'intent calibration' process that fine-tunes evals and guardrails to match your subjective expectations. We generate a custom training set to demonstrate the classification, then let you iterate via an agentic experience until the results are exactly where you want them
@reut_v_plurai Thanks )) It makes sense) The iteration part and 'intent calibration' sound esp valuable for subjective cases! ))
Plurai
@natalie_ermishina Thanks a lot, really appreciate it!
Great question on subjective tasks — that’s actually where this approach becomes even more interesting. Instead of relying on a generic judge, we define subjectivity explicitly (via the spec / examples), and then generate diverse boundary cases around that intent. The key is that labels aren’t coming from a single model they’re validated through multi-agent debate, which helps reduce ambiguity and noise in more nuanced cases
In practice, we’ve seen that once the SLM is trained on this kind of task-specific, high-fidelity data, it handles subjective criteria (tone, style, compliance, etc.) much more consistently than LLM-as-a-judge setups.
We go deeper into this (and share benchmarks) in the paper:
https://huggingface.co/papers/2604.25203
Would love to hear what kind of subjective evals you’re thinking about, that’s exactly where things get interesting 🙂
Plurai
@natalie_ermishina We're here if you have any more questions! Let us know what you think once you try it out!
Plurai
@natalie_ermishina Thanks a lot — really appreciate it!
On subjective tasks, we make the criteria explicit (spec + examples), generate boundary cases, and validate them with multi-agent debate — that’s what makes it consistent in practice
We shared more details here: https://huggingface.co/papers/2604.25203
Curious — what kind of subjective evals are you dealing with today?
@ilankad23 Thank you for the reply, gona have a look at the links first :)
Plurai
@natalie_ermishina glad you liked it, thank you!
You've mentioned 43% fewer failures, was that averaged on any type of task or does the industry have specific benchmarks for that?
Plurai
@michael_vavilov Great question!
The 43% fewer failures comes from our research benchmarks across multiple tasks (conversational policies, agent workflows, compliance), not a single narrow use case. In the paper, we evaluate across different domains and datasets, and consistently see that task-specific models trained with our method outperform LLM-as-a-judge baselines and generic guardrails
If you want the full breakdown (datasets, tasks, and comparisons), we shared it here:
https://huggingface.co/papers/2604.25203
Curious what kind of failures you’re measuring today?
Plurai
@michael_vavilov We're here if you have any more questions! Let us know what you think once you try it out!
NovaVoice
If this actually reduces hallucinations or cost + policy violations at scale, thats huge!
That's where most of the pain is for me
Plurai
@redzumi Totally hear you, that’s exactly the pain we built this for.
What we’re seeing in practice is that once you move from generic LLM-as-a-judge to a task-specific SLM trained on synthetic + debate-validated data, you get:
Fewer hallucinations / policy misses (because the model actually learns your failure modes, not generic ones)
Much lower cost + latency (small model, real-time)
And the ability to enforce on every interaction, not just sampled evals
It’s not magic, the key is the data. The paper shows that without proper validation, label noise kills performance, but with debate-based verification you get much cleaner signals and significantly better accuracy If you’re feeling this pain in production, you’re exactly the ICP we’re building for. Curious what kind of failures are hurting you most today?
Plurai
@redzumi That's really validating what we've been hearing and the pain we want to prevent! Let me know if we managed to do it for you!
Plurai
@redzumi We're here if you have any more questions! Let us know what you think once you try it out!
Plurai
@redzumi proof is in the pudding. Try it yourself! plurai.ai/launch
Plurai
@redzumi Indeed, in our research paper we demonstrate how our approach reduces significantly failures hallucinations or cost
Plurai
@redzumi @ilankad23 cool!
BlogBowl
Congrats on the launch, does it work with all LLMs that provide fine-tunning capabilities?
Plurai
@danshipit Thank you! Looping @ilan_kadar to answer your question
Plurai
@danshipit On the LLM optimization path we're fully model agnostic. On the SLM path we train the model ourselves on your policies — so either way, no fine-tuning on your end.
Plurai
@danshipit let us know what you thought!
Plurai
@danshipit Yes!
The multi-agent debate validation is the part I want to understand better. How do you keep the debate from converging on the same model's biases? Different model families per agent, or the same base with different role prompts? Asking because validation-by-consensus often inherits failure modes from the underlying judge, and avoiding that is the actual hard problem.
Plurai
@fredcallagan Thank you for your comment
Plurai
@fredcallagan looping in @ilankad23 and @reut_v_plurai to answer your question