菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-06
📄 Abstract - Prosodic Boundary-Aware Streaming Generation for LLM-Based TTS with Streaming Text Input

Streaming TTS that receives streaming text is essential for interactive systems, yet this scheme faces two major challenges: unnatural prosody due to missing lookahead and long-form collapse due to unbounded context. We propose a prosodic-boundary-aware post-training strategy, adapting a pretrained LLM-based TTS model using weakly time-aligned data. Specifically, the model is adapted to learn early stopping at specified content boundaries when provided with limited future text. During inference, a sliding-window prompt carries forward previous text and speech tokens, ensuring bounded context and seamless concatenation. Evaluations show our method outperforms CosyVoice-Style interleaved baseline in both short and long-form scenarios. In long-text synthesis, especially, it achieves a 66.2% absolute reduction in word error rate (from 71.0% to 4.8%) and increases speaker and emotion similarity by 16.1% and 1.5% relatively, offering a robust solution for streaming TTS with incremental text.

顶级标签: llm audio systems
详细标签: text-to-speech streaming generation prosody modeling incremental text speech synthesis 或 搜索:

面向流式文本输入的基于大语言模型的语音合成:一种韵律边界感知的流式生成方法 / Prosodic Boundary-Aware Streaming Generation for LLM-Based TTS with Streaming Text Input


1️⃣ 一句话总结

这篇论文提出了一种新的训练策略,让基于大语言模型的语音合成系统在接收连续输入的文本时,能够智能地预测并停在合适的韵律边界处,从而有效解决了因看不到后续文本导致的语调不自然和长文本合成崩溃两大难题,显著提升了流式语音合成的质量和稳定性。

源自 arXiv: 2603.06444