Launched this week

Plurai

Name: Plurai
Rating: 5.0 (1 reviews)

Launched this week

Vibe-train evals and guardrails tailored to your use case

5.0•1 review•

1.2K followers

Vibe-train evals and guardrails tailored to your use case

5.0•1 review•

1.2K followers

Visit website

Engineering & Development

•

AI Metrics and Evaluation

Vibe training for AI agent reliability. Describe what your agent should and should not do — Plurai generates training data, validates it, and deploys a custom model in minutes. It feels like vibe coding, but for evaluation and guardrails. No labeled data. No annotation pipeline. No prompt engineering. Under the hood, small language models deliver sub 100ms latency, 8x lower cost than GPT as judge, and over 43% fewer failures. Always on, not sampled. Built on published research (BARRED).

Launch tags:API•Developer Tools•Artificial Intelligence

Launch Team / Built With

Framer — Launch websites with enterprise needs at startup speeds.

Launch websites with enterprise needs at startup speeds.

Promoted

TabAI

The multi-turn simulation piece is interesting.
Single prompt evals are easy, but most real failures happen across a sequence of interactions.
If this actually captures that well, that’s a meaningful step up from most eval tooling I’ve seen.

Report

2d ago

Plurai

Maker

@igor_martinyuk exactly. that's one of the challenges we have been facing and a main differentiator

Report

2d ago

Plurai

Maker

@igor_martinyuk Exactly! most real failures aren’t single turns, they’re stateful across interactions.

That’s why we simulate multi-turn flows and generate edge cases across the sequence, not just isolated prompts including those “looks fine at each step, breaks at the end” scenarios.

Curious go hear what kind of multi-turn failures have you seen most often?

Report

2d ago

Plurai

Maker

@igor_martinyuk We're here if you have any more questions! Let us know what you think once you try it out!

Report

2d ago

Plurai

Maker

@igor_martinyukglad you love it!

Report

2d ago

Vibe training is such a good framing, finally something that matches how teams actually think about agent behavior. cheers team 🙌
BTW, what happens when two guardrails conflict with each other at runtime?

Report

2d ago

Plurai

Maker

@boyuan_deng1 thank you :) we're also obsessed with the framing 🤩
each guardrail returns its classification and reasoning - and your "state machine" can figure out how to mitigate between the two having the full context

Report

2d ago

Plurai

Maker

@reut_v_plurai great answer, and "state machine" is exactly the right mental model here 🎯

Report

2d ago

Plurai

Maker

@boyuan_deng1 means a lot, we're obsessed with that framing too 🙌 @reut_v_plurai nailed the answer below — each guardrail returns its classification + reasoning, so your logic layer has full context to resolve conflicts. Not just verdicts, actual signal.

Report

2d ago

Plurai

Maker

@boyuan_deng1 did you get a chance to try the product! Curious what you think

Report

2d ago

Plurai

Maker

@boyuan_deng1 each guardrail returns its classification, confidence and reasoning so you have the full context

Report

2d ago

Oh, this looks really cool, esp the idea of running evals on every interaction (not just samples). Just curious, how it performs on more subjective tasks though))) And congrats on the launch, btw :)

Report

2d ago

Plurai

Maker

@natalie_ermishina Great question, Natalie! We use an 'intent calibration' process that fine-tunes evals and guardrails to match your subjective expectations. We generate a custom training set to demonstrate the classification, then let you iterate via an agentic experience until the results are exactly where you want them

Report

2d ago

@reut_v_plurai Thanks )) It makes sense) The iteration part and 'intent calibration' sound esp valuable for subjective cases! ))

Report

2d ago

Plurai

Maker

@natalie_ermishina Thanks a lot, really appreciate it!

Great question on subjective tasks — that’s actually where this approach becomes even more interesting. Instead of relying on a generic judge, we define subjectivity explicitly (via the spec / examples), and then generate diverse boundary cases around that intent. The key is that labels aren’t coming from a single model they’re validated through multi-agent debate, which helps reduce ambiguity and noise in more nuanced cases

In practice, we’ve seen that once the SLM is trained on this kind of task-specific, high-fidelity data, it handles subjective criteria (tone, style, compliance, etc.) much more consistently than LLM-as-a-judge setups.

We go deeper into this (and share benchmarks) in the paper:
https://huggingface.co/papers/2604.25203

Would love to hear what kind of subjective evals you’re thinking about, that’s exactly where things get interesting 🙂

Report

2d ago

Plurai

Maker

@natalie_ermishina We're here if you have any more questions! Let us know what you think once you try it out!

Report

2d ago

Plurai

Maker

@natalie_ermishina Thanks a lot — really appreciate it!

On subjective tasks, we make the criteria explicit (spec + examples), generate boundary cases, and validate them with multi-agent debate — that’s what makes it consistent in practice

We shared more details here: https://huggingface.co/papers/2604.25203

Curious — what kind of subjective evals are you dealing with today?

Report

2d ago

@ilankad23 Thank you for the reply, gona have a look at the links first :)

Report

2d ago

Plurai

Maker

@natalie_ermishina glad you liked it, thank you!

Report

2d ago

You've mentioned 43% fewer failures, was that averaged on any type of task or does the industry have specific benchmarks for that?

Report

2d ago

Plurai

Maker

@michael_vavilov Great question!

The 43% fewer failures comes from our research benchmarks across multiple tasks (conversational policies, agent workflows, compliance), not a single narrow use case. In the paper, we evaluate across different domains and datasets, and consistently see that task-specific models trained with our method outperform LLM-as-a-judge baselines and generic guardrails

If you want the full breakdown (datasets, tasks, and comparisons), we shared it here:
https://huggingface.co/papers/2604.25203

Curious what kind of failures you’re measuring today?

Report

2d ago

Plurai

Maker

@michael_vavilov We're here if you have any more questions! Let us know what you think once you try it out!

Report

2d ago

NovaVoice

If this actually reduces hallucinations or cost + policy violations at scale, thats huge!

That's where most of the pain is for me

Report

2d ago

Plurai

Maker

@redzumi Totally hear you, that’s exactly the pain we built this for.

What we’re seeing in practice is that once you move from generic LLM-as-a-judge to a task-specific SLM trained on synthetic + debate-validated data, you get:

Fewer hallucinations / policy misses (because the model actually learns your failure modes, not generic ones)
Much lower cost + latency (small model, real-time)
And the ability to enforce on every interaction, not just sampled evals

It’s not magic, the key is the data. The paper shows that without proper validation, label noise kills performance, but with debate-based verification you get much cleaner signals and significantly better accuracy If you’re feeling this pain in production, you’re exactly the ICP we’re building for. Curious what kind of failures are hurting you most today?

Report

2d ago

Plurai

Maker

@redzumi That's really validating what we've been hearing and the pain we want to prevent! Let me know if we managed to do it for you!

Report

2d ago

Plurai

Maker

@redzumi We're here if you have any more questions! Let us know what you think once you try it out!

Report

2d ago

Plurai

Maker

@redzumi proof is in the pudding. Try it yourself! plurai.ai/launch

Report

2d ago

Plurai

Maker

@redzumi Indeed, in our research paper we demonstrate how our approach reduces significantly failures hallucinations or cost

Report

2d ago

Plurai

Maker

@redzumi @ilankad23 cool!

Report

1d ago

BlogBowl

Congrats on the launch, does it work with all LLMs that provide fine-tunning capabilities?

Report

2d ago

Plurai

Maker

@danshipit Thank you! Looping @ilan_kadar to answer your question

Report

2d ago

Plurai

Maker

@danshipit On the LLM optimization path we're fully model agnostic. On the SLM path we train the model ourselves on your policies — so either way, no fine-tuning on your end.

Report

2d ago

Plurai

Maker

@danshipit let us know what you thought!

Report

2d ago

Plurai

Maker

@danshipit Yes!

Report

2d ago

The multi-agent debate validation is the part I want to understand better. How do you keep the debate from converging on the same model's biases? Different model families per agent, or the same base with different role prompts? Asking because validation-by-consensus often inherits failure modes from the underlying judge, and avoiding that is the actual hard problem.

Report

1d ago

Plurai

Maker

@fredcallagan Thank you for your comment

Report

1d ago

Plurai

Maker

@fredcallagan looping in @ilankad23 and @reut_v_plurai to answer your question

Report

1d ago

1 2 3

•••

Forum Threads

p/plurai

•

1d ago

Plurai - Setting up the launchpad

Plurai is launching on Product Hunt this week, introducing the first vibe-training platform to build real-time, tailored evals for your AI agents, with high accuracy, at a fraction of the cost.

I had the opportunity to collaborate with their team on this first launch after months in stealth modeI - no pressure - and wanted to share with you some insights on how we prepped it.

View all

@natalie_ermishina Thanks a lot, really appreciate it!

We go deeper into this (and share benchmarks) in the paper:
https://huggingface.co/papers/2604.25203

Would love to hear what kind of subjective evals you’re thinking about, that’s exactly where things get interesting 🙂