Qwen3-TTS - Voice design, cloning & 97ms streaming

Flowtica Scribe

•2mo ago

A family of SOTA speech models (0.6B & 1.7B) supporting 10 languages. Features prompt-based Voice Design, 3s zero-shot cloning, and extreme low-latency streaming.

Replies

Best

Flowtica Scribe

Hunter

📌

Hi everyone!

The Qwen team just dropped what might be the most comprehensive open-source TTS release we have seen. Qwen3-TTS combines three things that are usually mutually exclusive: SOTA quality, extreme speed, and creative control.

The "Voice Design" feature is really robust—just describing the persona (e.g., "sad old man") works surprisingly well.

Technically, the efficiency is wild. They use a 12Hz tokenizer to compress speech without losing detail, bringing the latency down to just 97ms 🤯

Open source TTS just raised the bar again. If you are building anything with voice, you might wanna check this out.

Demo Here.

Report

2mo ago

This is seriously impressive. Hitting sub-100ms latency and keeping quality + creative control is rare, especially in open source.

The voice design angle is what excites me most — being able to describe a persona instead of tweaking endless params feels like the right abstraction. This could unlock way more natural voice UX for real products, not just demos.

Big props to the Qwen team 👏

Report

2mo ago

Camocopy

Okay but which languages? Why not show the 10 languages more obvious

Report

2mo ago