Garry Tan

Cekura - Observe and analyze your voice and chat AI agents

Out-of-the-box 30+ predefined metrics for analysis on CX, accuracy, conversation and voice quality. Compile perfect LLM judges by annotating just ~20 conversations and auto-improve in Cekura labs. Real-time, segmented dashboards to identify trends in Conversational AI. Smart statistical alerts so that you get notified only when metrics shift from historical baselines. Automated system pings to catch silent production failures.

Add a comment

Replies

Best
Pratyush Saini

What aspects of voice does it capture? I wanted to test on tonality and personality of my voice agent, is it achievable?

Sidhant Kabra

@pratyush1505 We have voice clairty, gibberishness as a metric to capture the voice aspect of the agent

Satvik Dixit

@pratyush1505 For testing the personality of the agent, you can also checkout the Customer Satisfaction (CSAT) and Sentiment metrics

Shashij Gupta

@pratyush1505 you can also use voice clarity metric which will check how clear the voice is

Yash Jain

Can we use Cekura to benchmark STT / TTS separately as well or its only used for Voice AI agents ?

Sidhant Kabra

@yash_jain49 Yes we have TTS specific metrics like Pronunciation Issues and Voice Quality as well as we measure Transcription accuracy to compare STT.

While simulations are run on Voice AI agents - you can run simulations with same set of test cases and same config on main agents except changing the STT or TTS provider

Shashij Gupta

@yash_jain49 Not able to understand you completely . What do you mean by separately here?

Dhruv Jaglan

Are these prefedined metrics all on Audio or text based ?

Sidhant Kabra

@dhruvjaglan Its a mix. All the voice specific metrics (Silence, latency, interruptions, pronunciation issues etc) need audio. Accuracy metrics (relevancy, hallucination, reponse consistency etc) is text based

Shashij Gupta

@dhruvjaglan Some are on text and some are on voice

Dileep

Excited to see this go live! 🚀

Working on our voice simulations and agent stack taught me that reliability is all about the nuances. We built Cekura to give developers the specific visibility needed to master those details and move past the guesswork.

Can't wait to see everyone dive into the labs and start leveling up their agents!

Mykola Kondratiuk

The silent production failure detection is what catches my eye. When you're running AI agents in prod, the scariest failures are the ones where nothing errors out - it just gives bad output for days without anyone noticing. Curious how Cekura handles the baseline drift problem - do you need a human to label 'good' vs 'bad' outputs, or does it pick that up automatically?

Sidhant Kabra

@mykola_kondratiuk Human labelling is recommended for any metric you define - you label only 20 calls in our optimizer to ensure the LLM-as-a-judge covers all the edge cases

Mykola Kondratiuk

20 calls to bootstrap the judge is surprisingly low - that's actually pretty approachable for most teams. The LLM-as-judge approach makes sense for scale once you've got those calibration samples.

Shashij Gupta

@mykola_kondratiuk Human labelling help fine tune the metric and make it highly accurate for the good/bad identification. And at scale this metric then goes on and evaluate 1000s of calls with very high accuracy

Mykola Kondratiuk

Right - the labelling bootstraps the judge, then the judge scales. Makes sense as a two-phase approach.

Sidhant Kabra
Mykola Kondratiuk

glad it landed well. good luck with the launch!

Tarush Agarwal

Huge congrats to the team! 🚀 such a solid group of builders. This solves a lot of different use cases - instant alerting, human in the loop reviews, A/B testing and more without feeling cluttered.

Sergio

Congrats on #2, @Cekura

Just flagged a UX loop on mobile signup ,it's showing 'User Not Found' and forcing a logout for new users. It looks like a system crash rather than a filter.

I've got the fix details ready to help you keep your conversion high today. Where can I send the report?

Sidhant Kabra

@sergioding Oh Can you share a report at support@cekura.ai - will be really helpful

Sergio

@kabra_sidhant Thanks, Just sent the fix report and the UX optimization steps to your support email.

Dharamveer singh

@sergioding Likely caused by unsupported email domains Gmail, iCloud, and other public providers aren’t allowed, which triggers the ‘User Not Found’ . Recommend using a work email (e.g., @cekura.ai).

Sergio

@dddharamveeer Exactly, it’s the Gmail/iCloud filter triggering a 'User Not Found' state. On mobile, that feels like a system crash to a new user. I've mapped out the fix to keep your enterprise funnel clean while you're at #3. Let's keep the momentum going!

Himank Jain

Congrats on the launch, team!

What are the challenges come when teams tries to build this internally?

Sidhant Kabra

@himank_jain1 Building and optimizing each metric over a dataset takes months of engineering left and fine-tuning. Lot of these metrics are not even LLM based but uses huerestics and statistical models. Having said that, team can build a basic analytics dashboard if voice metrics or smart alerts is not that important and they only need to analyze few specific workflow metric only

Rishabh Sanjay

@himank_jain1 Another challenge arises when a new LLM enters the market. If we want to switch because the new model is better or because the old one is being deprecated—we have to re-optimize all our prompt metrics against the eval set, which is a huge undertaking. This makes the eval set the most important factor; it stays constant, while the prompts change regularly to adapt to new LLMs.

Satvik Dixit

Super excited to see this out!

Got to work closely on the metrics side of things. Seeing it come together into something teams can actually rely on in production is incredibly satisfying.

Huge shoutout to the team for pushing this across the finish line.

Roshan Rajak l byteio.ɪn

Big congrats to the @Cekura team on the launch! 🚀