RelayS2S:一种用于实时对话的双路径推测生成方法 / RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue
1️⃣ 一句话总结
这篇论文提出了一种名为RelayS2S的混合架构,它通过让一个快速路径(端到端语音模型)和一个慢速路径(级联语音识别与大语言模型管道)并行工作,在实时语音对话中巧妙地平衡了低延迟和高响应质量之间的矛盾,实现了既快又好的对话体验。
Real-time spoken dialogue systems face a fundamental tension between latency and response quality. End-to-end speech-to-speech (S2S) models respond immediately and naturally handle turn-taking, backchanneling, and interruption, but produce semantically weaker outputs. Cascaded pipelines (ASR -> LLM) deliver stronger responses at the cost of latency that grows with model size. We present RelayS2S, a hybrid architecture that runs two paths in parallel upon turn detection. The fast path -- a duplex S2S model -- speculatively drafts a short response prefix that is streamed immediately to TTS for low-latency audio onset, while continuing to monitor live audio events. The slow path -- a cascaded ASR -> LLM pipeline -- generates a higher-quality continuation conditioned on the committed prefix, producing a seamless utterance. A lightweight learned verifier gates the handoff, committing the prefix when appropriate or falling back gracefully to the slow path alone. Experiments show that RelayS2S achieves P90 onset latency comparable to the S2S model while retaining 99% cascaded response quality in average score, with benefits growing as the slow-path model scales. Because the prefix handoff requires no architectural modification to either component, RelayS2S serves as a lightweight, drop-in addition to existing cascaded pipelines. Our code and data are publicly available at: this https URL
RelayS2S:一种用于实时对话的双路径推测生成方法 / RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue
这篇论文提出了一种名为RelayS2S的混合架构,它通过让一个快速路径(端到端语音模型)和一个慢速路径(级联语音识别与大语言模型管道)并行工作,在实时语音对话中巧妙地平衡了低延迟和高响应质量之间的矛盾,实现了既快又好的对话体验。
源自 arXiv: 2603.23346