← 返回列表

arXiv 提交日期: 2026-06-23

📄 Abstract - ZONOS2 Technical Report

We present ZONOS2 8B, our latest TTS model, which achieves state-of-the-art naturalness, prosody, and voice cloning fidelity. We improve upon Zonos-v0.1 across scale, data, and training recipe. We scale the model from 1.6B to 8B total parameters (900M active) with a novel mixture-of-experts (MoE) backbone, improving inference latency and throughput. We expand our training corpus from 200K to over 6M hours using a new data processing pipeline, and we simplify our post-training and conditioning recipes to improve naturalness and voice cloning fidelity. We evaluate ZONOS2 8B on quality, speaker similarity, WER, and ZTTS1-Eval, our novel TTS benchmark, where it performs competitively with state-of-the-art systems while maintaining good streaming latency. We release our model weights and example inference code under an Apache 2.0 license on GitHub and Hugging Face.

顶级标签: audio model training model evaluation

ZONOS2 技术报告 / ZONOS2 Technical Report

1️⃣ 一句话总结

本文介绍了新一代文本转语音模型ZONOS2 8B，通过采用混合专家架构、大规模扩展训练数据至600万小时以及优化训练流程，在语音自然度、韵律和声音克隆保真度上达到业界领先水平，同时保持了低延迟的流式处理能力，并开源了模型权重和推理代码。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2606.24320

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要