A family of SOTA speech models (0.6B & 1.7B) supporting 10 languages. Features prompt-based Voice Design, 3s zero-shot cloning, and extreme low-latency streaming.
The Qwen team just dropped what might be the most comprehensive open-source TTS release we have seen. Qwen3-TTS combines three things that are usually mutually exclusive: SOTA quality, extreme speed, and creative control.
The "Voice Design" feature is really robust—just describing the persona (e.g., "sad old man") works surprisingly well.
Technically, the efficiency is wild. They use a 12Hz tokenizer to compress speech without losing detail, bringing the latency down to just 97ms 🤯
Open source TTS just raised the bar again. If you are building anything with voice, you might wanna check this out.
This is seriously impressive. Hitting sub-100ms latencyand keeping quality + creative control is rare, especially in open source.
The voice design angle is what excites me most — being able to describe a persona instead of tweaking endless params feels like the right abstraction. This could unlock way more natural voice UX for real products, not just demos.
Replies
Flowtica Scribe
Hi everyone!
The Qwen team just dropped what might be the most comprehensive open-source TTS release we have seen. Qwen3-TTS combines three things that are usually mutually exclusive: SOTA quality, extreme speed, and creative control.
The "Voice Design" feature is really robust—just describing the persona (e.g., "sad old man") works surprisingly well.
Technically, the efficiency is wild. They use a 12Hz tokenizer to compress speech without losing detail, bringing the latency down to just 97ms 🤯
Open source TTS just raised the bar again. If you are building anything with voice, you might wanna check this out.
Demo Here.
This is seriously impressive. Hitting sub-100ms latency and keeping quality + creative control is rare, especially in open source.
The voice design angle is what excites me most — being able to describe a persona instead of tweaking endless params feels like the right abstraction. This could unlock way more natural voice UX for real products, not just demos.
Big props to the Qwen team 👏
Camocopy
Okay but which languages? Why not show the 10 languages more obvious