Zac Zuo

Cohere Transcribe - New state-of-the-art in open source speech recognition

Cohere Transcribe is a state-of-the-art, 2B open-weights speech recognition model. Optimized for enterprise workloads, it delivers high throughput and a leading 5.42% WER across 14 languages, making it ideal for private, local, or desktop deployment.

Add a comment

Replies

Best
Zac Zuo

Hi everyone!

Cohere just open-sourced Transcribe, and the core metrics here, especially the throughput and the 5.42% average WER, are genuinely impressive.

From an engineering point of view, this looks like a fantastic model for Mac/PC local apps or private enterprise servers. At 2B parameters, though, it still feels a bit heavy for raw on-device mobile deployment.

It is also worth noting that this is a highly optimized transcription engine, not a fully packaged meeting intelligence stack. Out of the box, you will still want to add your own layer for things like word-level timestamps and speaker diarization.

It also seems to perform best when you specify the language and avoid heavy code-switching.

But if you handle those pre- and post-processing steps and keep the audio mostly in a single language, this open-weight model looks extremely strong for privacy-first, local speech workflows.

swati paliwal

@zaczuo Have you tested quantization or distillation tweaks to slim it down for mobile edge cases, like real-time podcast transcription on iOS?

S. Ferit Arslan

@zaczuo This is awesome! I used Cohere's Rerank API while building Octopus (an open-source AI code reviewer) and it worked great for improving search relevance across codebases. Love seeing Cohere push more into open source, Transcribe looks really promising, especially for privacy-first use cases.

Germán Merlo

Wow Zac! Was looking for something like this. What about the pricing? Is there a benchmark to compare with current competitors?

Alexia Li

Cohere Transcribe looks like a game-changer for privacy-focused teams needing fast, accurate transcription across multiple languages. For non-technical users, will there be an “out-of-the-box” version with speaker labeling and timestamps?

Mohammed Abdul Saboor

How does it handle noisy environments and accented speech? Because that's where most models still drop off - would love to try this out in our office where people speak with diverse accents and sit in close quarters. Any reviews from users?