Zac Zuo

VoxCPM2 - Open-source 48kHz TTS with voice design and cloning

byβ€’
VoxCPM2 is a 2B open-source TTS model with 30-language support, 48kHz output, voice design from text alone, controllable voice cloning, and real-time streaming fast enough for production voice workflows.

Add a comment

Replies

Best
Zac Zuo

Hi everyone!

VoxCPM2 is the next-generation open-source audio model from the @MiniCPM family, and it perfectly continues their signature trait of incredible "capability density" β€” packing all of these features into a model that is only 2B parameters!

Despite its highly compact size, the feature set it brings to the table is quite rare for an open-source release:

  • Voice Design: Instead of hunting for the perfect reference audio to clone, you can just prompt the model directly (e.g., (A young woman, gentle and sweet voice) Hello world.). It generates a completely novel voice on the fly.

  • Native 48kHz Output: It has a built-in super-resolution VAE, meaning no external upsamplers are needed to get studio-quality audio.

  • Controllable Voice Cloning: You can clone a voice from a short clip, but still steer the emotion, pacing, and style using text prompts.

  • Production-Ready: It hits an RTF of ~0.13 for real-time streaming and is fully open-source under the Apache-2.0 license.

It is incredibly refreshing to see this level of controllable, high-fidelity audio hit the open-source ecosystem in such a lightweight package.

Try it out here!

swati paliwal

@zaczuoΒ Have you seen folks using it yet for quick custom podcast intros or branded voiceovers in marketing?

Dmytro Klymentiev

Voice design from text prompts instead of hunting for a reference clip is the thing I didn't know I needed. "A tired middle-aged man reading terms of service" and it just... makes that? 2B parameters for this is wild. Will try it locally today.

Kiyoshi Nagahama

2B params delivering 48kHz + voice design + cloning is impressive capability density. As someone building an audio/video editing tool that relies on audio analysis for precise segment boundaries, I appreciate how much source quality matters.

Curious: how does VoxCPM2 handle multilingual switching within a single utterance β€” e.g. Japanese with embedded English terms?