VoxCPM - Tokenizer-free TTS for true-to-life voice

VoxCPM is a new open-source, tokenizer-free TTS model. By modeling speech in a continuous space, it overcomes the limitations of discrete tokens to deliver highly expressive, context-aware speech generation and incredibly realistic zero-shot voice cloning.

Hi everyone!

The next big challenge for TTS isn't just clarity, but expressiveness. Many models sound clear, but still feel a bit robotic because they break speech down into discrete tokens, losing the natural flow of the human voice.

VoxCPM from the OpenBMB and ModelBest teams takes a different path. It's a "tokenizer-free" model, and you can really hear the difference in the final output.

Two things really stand out to me. First, its context-aware generation, it can read a piece of text and automatically know whether to sound like a storyteller or a weather reporter. Second, the zero-shot voice cloning is incredibly realistic, capturing not just the timbre but also the unique accent and emotional tone of the speaker.

It's an open-source model and runs efficiently on consumer GPUs.

VoxCPM - Tokenizer-free TTS for true-to-life voice

Replies