📄
Abstract - All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation
Large Audio-Language Models show consistent performance gains across speech and audio benchmarks, yet high scores may not reflect true auditory perception. If a model can answer questions without processing the acoustic signal, the benchmark fails as a measure of auditory understanding. We present a diagnostic framework using two axes: text prior, which measures answerability from text and general knowledge alone, and audio reliance, which assesses actual dependency on the acoustic signal. Evaluating eight LALMs across three benchmarks, we find that models retain 60-72% of their full audio scores even without any audio input. Moreover, among items that require audio, only 3.0-4.2% need the complete audio clip; the majority can be resolved using localized fragments. These findings challenge the assumption that benchmark performance equals robust audio understanding, and we conclude with practical guidelines for improving evaluation reliability and benchmark design.
闪光的不一定是音频:重新审视音频-语言评估中的文本先验与音频依赖 /
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation
1️⃣ 一句话总结
该论文指出,当前大型音频-语言模型在许多测试中取得的高分,往往不是因为它们真正理解了音频信号,而是依赖文本或常识就能猜出答案,只有极少数问题才真正需要完整音频,因此现有基准测试并不可靠,作者提出了更严格的评估方法。