X-Voice:让每个人都能通过零样本跨语言语音克隆说30种语言 / X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning
1️⃣ 一句话总结
X-Voice是一个仅有4亿参数的多语言语音克隆模型,通过两阶段训练和国际音标统一表示,无需任何文本转录即可模仿任意说话者的声音,并让该声音说30种不同语言,其性能与数十亿参数的大模型相当。
In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice$_{\text{s1}}$ through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio prompts. In Stage 2, we fine-tune on these audio pairs with prompt text masked to derive X-Voice$_{\text{s2}}$, which enables zero-shot voice cloning without requiring transcripts of audio prompts. Architecturally, we extend F5-TTS by implementing a dual-level injection of language identifiers and decoupling and scheduling of Classifier-Free Guidance to facilitate multilingual speech synthesis. Subjective and objective evaluation results demonstrate that X-Voice outperforms existing flow-matching based multilingual systems like LEMAS-TTS and achieves zero-shot cross-lingual cloning capabilities comparable to billion-scale models such as Qwen3-TTS. To facilitate research transparency and community advancement, we open-source all related resources.
X-Voice:让每个人都能通过零样本跨语言语音克隆说30种语言 / X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning
X-Voice是一个仅有4亿参数的多语言语音克隆模型,通过两阶段训练和国际音标统一表示,无需任何文本转录即可模仿任意说话者的声音,并让该声音说30种不同语言,其性能与数十亿参数的大模型相当。
源自 arXiv: 2605.05611