菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-16
📄 Abstract - FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning

Recent end-to-end spoken dialogue systems leverage speech tokenizers and neural audio codecs to enable LLMs to operate directly on discrete speech representations. However, these models often exhibit limited speaker identity preservation, hindering personalized voice interaction. In this work, we present Chroma 1.0, the first open-source, real-time, end-to-end spoken dialogue model that achieves both low-latency interaction and high-fidelity personalized voice cloning. Chroma achieves sub-second end-to-end latency through an interleaved text-audio token schedule (1:2) that supports streaming generation, while maintaining high-quality personalized voice synthesis across multi-turn conversations. Our experimental results demonstrate that Chroma achieves a 10.96% relative improvement in speaker similarity over the human baseline, with a Real-Time Factor (RTF) of 0.43, while maintaining strong reasoning and dialogue capabilities. Our code and models are publicly available at this https URL and this https URL .

顶级标签: llm audio multi-modal
详细标签: spoken dialogue systems voice cloning real-time generation speech tokenization personalization 或 搜索:

FlashLabs Chroma 1.0:一种具备个性化语音克隆功能的实时端到端口语对话模型 / FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning


1️⃣ 一句话总结

这篇论文提出了首个开源的实时端到端口语对话模型Chroma 1.0,它不仅能实现亚秒级的低延迟对话,还能在连续多轮对话中高质量地克隆并保持用户的个性化语音,使得语音助手听起来更像真人。

源自 arXiv: 2601.11141