菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-09
📄 Abstract - On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.

顶级标签: audio model evaluation natural language processing
详细标签: spoken language models perplexity evaluation metrics speech generation human evaluation 或 搜索:

论口语语言模型评估中全局词元困惑度的谬误 / On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation


1️⃣ 一句话总结

这篇论文指出,直接套用文本模型评估指标(全局词元困惑度)来评价口语生成模型是不准确的,并提出了一系列新的评估方法,这些新方法能更好地反映模型生成语音的真实质量,并显著缩小了最佳模型与人类水平之间的性能差距。

源自 arXiv: 2601.06329