面向模态与口语化的语音对话奖励建模与基准测试 / Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness
1️⃣ 一句话总结
这篇论文提出了一个名为SDiaReward的奖励模型和一个配套的基准测试ESDR-Bench,专门用于评估语音对话系统在语调情感等副语言特征以及自然口语化表达方面的表现,从而帮助系统生成更像真人对话的语音。
The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at this https URL.
面向模态与口语化的语音对话奖励建模与基准测试 / Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness
这篇论文提出了一个名为SDiaReward的奖励模型和一个配套的基准测试ESDR-Bench,专门用于评估语音对话系统在语调情感等副语言特征以及自然口语化表达方面的表现,从而帮助系统生成更像真人对话的语音。
源自 arXiv: 2603.14889