ZONOS2 技术报告 / ZONOS2 Technical Report
1️⃣ 一句话总结
本文介绍了新一代文本转语音模型ZONOS2 8B,通过采用混合专家架构、大规模扩展训练数据至600万小时以及优化训练流程,在语音自然度、韵律和声音克隆保真度上达到业界领先水平,同时保持了低延迟的流式处理能力,并开源了模型权重和推理代码。
We present ZONOS2 8B, our latest TTS model, which achieves state-of-the-art naturalness, prosody, and voice cloning fidelity. We improve upon Zonos-v0.1 across scale, data, and training recipe. We scale the model from 1.6B to 8B total parameters (900M active) with a novel mixture-of-experts (MoE) backbone, improving inference latency and throughput. We expand our training corpus from 200K to over 6M hours using a new data processing pipeline, and we simplify our post-training and conditioning recipes to improve naturalness and voice cloning fidelity. We evaluate ZONOS2 8B on quality, speaker similarity, WER, and ZTTS1-Eval, our novel TTS benchmark, where it performs competitively with state-of-the-art systems while maintaining good streaming latency. We release our model weights and example inference code under an Apache 2.0 license on GitHub and Hugging Face.
ZONOS2 技术报告 / ZONOS2 Technical Report
本文介绍了新一代文本转语音模型ZONOS2 8B,通过采用混合专家架构、大规模扩展训练数据至600万小时以及优化训练流程,在语音自然度、韵律和声音克隆保真度上达到业界领先水平,同时保持了低延迟的流式处理能力,并开源了模型权重和推理代码。
源自 arXiv: 2606.24320