Cekura

Automated QA for Voice AI and Chat AI agents

1.8K followers

Automated QA for Voice AI and Chat AI agents

1.8K followers

Visit website

AI Metrics and Evaluation

Cekura enables Conversational AI teams to automate QA across the entire agent lifecycle—from pre-production simulation and evaluation to monitoring of production calls. We also support seamless integration into CI/CD pipelines, ensuring consistent quality and reliability at every stage of development and deployment.

This is the 3rd launch from Cekura. View more

Cekura

Launched this week

Observe and analyze your voice and chat AI agents

Out-of-the-box 30+ predefined metrics for analysis on CX, accuracy, conversation and voice quality. Compile perfect LLM judges by annotating just ~20 conversations and auto-improve in Cekura labs. Real-time, segmented dashboards to identify trends in Conversational AI. Smart statistical alerts so that you get notified only when metrics shift from historical baselines. Automated system pings to catch silent production failures.

Free Options

Launch tags:SaaS•Developer Tools•Audio

Launch Team / Built With

Tines — Build agents & automations integrated across your workspace

Build agents & automations integrated across your workspace

Promoted

The silent production failure detection is what catches my eye. When you're running AI agents in prod, the scariest failures are the ones where nothing errors out - it just gives bad output for days without anyone noticing. Curious how Cekura handles the baseline drift problem - do you need a human to label 'good' vs 'bad' outputs, or does it pick that up automatically?

Report

6d ago

Cekura

Maker

@mykola_kondratiuk Human labelling is recommended for any metric you define - you label only 20 calls in our optimizer to ensure the LLM-as-a-judge covers all the edge cases

Report

6d ago

20 calls to bootstrap the judge is surprisingly low - that's actually pretty approachable for most teams. The LLM-as-judge approach makes sense for scale once you've got those calibration samples.

Report

6d ago

Cekura

Maker

@mykola_kondratiuk Human labelling help fine tune the metric and make it highly accurate for the good/bad identification. And at scale this metric then goes on and evaluate 1000s of calls with very high accuracy

Report

6d ago

Right - the labelling bootstraps the judge, then the judge scales. Makes sense as a two-phase approach.

Report

6d ago

Cekura

Maker

@mykola_kondratiuk Exactly!

Report

6d ago

glad it landed well. good luck with the launch!

Report

6d ago

Congrats on #2, @Cekura

Just flagged a UX loop on mobile signup ,it's showing 'User Not Found' and forcing a logout for new users. It looks like a system crash rather than a filter.

I've got the fix details ready to help you keep your conversion high today. Where can I send the report?

Report

6d ago

Cekura

Maker

@sergioding Oh Can you share a report at support@cekura.ai - will be really helpful

Report

6d ago

@kabra_sidhant Thanks, Just sent the fix report and the UX optimization steps to your support email.

Report

6d ago

Cekura

Maker

@sergioding Likely caused by unsupported email domains Gmail, iCloud, and other public providers aren’t allowed, which triggers the ‘User Not Found’ . Recommend using a work email (e.g., @cekura.ai).

Report

6d ago

@dddharamveeer Exactly, it’s the Gmail/iCloud filter triggering a 'User Not Found' state. On mobile, that feels like a system crash to a new user. I've mapped out the fix to keep your enterprise funnel clean while you're at #3. Let's keep the momentum going!

Report

6d ago

@kabra_sidhant congrats on the launch and great to see as how Cekura shifts the focus from “ is the AI up ?” to " is the AI behaving correctly ? " for voice and chat agents. it was a missing layer for teams shipping real‑world conversational AI at scale. but how do you handle wildly different voice/chat‑agent use cases , any approach ?

Report

6d ago

Cekura

Maker

@kabra_sidhant @randhir_kumar7 We find that all conversational agents (chat or voice) need similar metrics to evaluate the content of the conversation - metrics like relevancy, hallucination and customer satisfaction .
Voice agents add complexity, so we have metrics for interruption, latency, pronunciation, and voice quality.
For use-case-specific evaluation (did the agent book the appointment? collect insurance info?) teams can write custom LLM Judge metrics in plain English

Report

6d ago

When Cekura flags an issue in production, what does fixing it actually look like in practice? Do teams usually retrain models, tweak prompts, or handle it more on a case‑by‑case basis?

Report

6d ago

Cekura

Maker

@jared_salois There are 3 types of issues:

prompt level - you tweak
model level - you A/B test and measure tradeoffs
config level - it is case by case. for eg: there is abrupt silence during a certain tool call - that's because the connection was not setup correctly

Report

6d ago

Nas.io

How do you handle false positives in sentiment or hallucination detection?

Report

6d ago

Cekura

Maker

@nuseir_yassin1 that's where our metric optimizer comes in. You can use it not only for your custom metrics but can also give feedback to our pre-defined metric in case of false positives and auto-improve

Report

6d ago

Congratulations on the launch!!

Do you guys also support on prem deployment to ensure privacy?

Report

7d ago

Cekura

Maker

@nikunjagarwal321 We support VPC deployments on customer instance. Additionally:

We sign BAA and DPA with customers
We have PII redaction on our side both from audio as well as transcript

Report

7d ago

Cekura

Maker

@nikunjagarwal321 yes we do

Report

6d ago

Can we use Cekura to benchmark STT / TTS separately as well or its only used for Voice AI agents ?

Report

7d ago

Cekura

Maker

@yash_jain49 Yes we have TTS specific metrics like Pronunciation Issues and Voice Quality as well as we measure Transcription accuracy to compare STT.

While simulations are run on Voice AI agents - you can run simulations with same set of test cases and same config on main agents except changing the STT or TTS provider

Report

7d ago

Cekura

Maker

@yash_jain49 Not able to understand you completely . What do you mean by separately here?

Report

6d ago

1 2 3

•••

Previous Cekura Launches

CekuraLaunch reliable voice & chat AI agents 10x faster

Launched on June 24th, 2025

VoceraLaunch voice agents faster with simulation & monitoring

Launched on November 13th, 2024

Forum Threads

p/vocera

•

7d ago

LLM-as-a-judge based monitoring is not enough for Voice AI

Most teams scaling Voice AI think they can monitor quality with a simple LLM prompt. They are wrong.

An LLM can t hear a "crunchy" voice line, it can t accurately measure a 500ms "barge-in," and it struggles with the nuances of true conversational flow.

When we built Cekura Monitoring, we realized we had to go beyond the LLM. We combined Heuristic and Statistical models with our Metric Optimizer to solve the "Scaling Wall."

View all