菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-02
📄 Abstract - Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs

Recent advancements in Large Speech-Language Models have significantly bridged the gap between acoustic signals and linguistic understanding. However, a persistent performance disparity remains in speech-based input tasks compared to direct text inference. In this paper, we investigate the dynamic roots of this modality gap beyond static geometric alignment, analyzing how speech and text representations evolve layer-by-layer. We evaluate four open-weight end-to-end models on SpeechMMLU and VoiceBench BBH. Using cross-layer CKA analysis with speech-text token alignment, we find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech where semantic content spans multiple frames. We show that these alignment patterns are structurally stable across different analysis configurations. Crucially, simple statistical calibration is insufficient and can be detrimental when applied at the input layer, indicating that the modality gap is not a mere distribution shift. Overall, our results suggest that the bottleneck lies in condensing redundant speech into stable late-layer decisions, motivating future solutions that operate at the token or temporal granularity instead of feature-level matching.

顶级标签: llm natural language processing audio
详细标签: modality gap speech representation cross-layer analysis speech-language models representation alignment 或 搜索:

模态鸿沟的解剖:剖析端到端语音大语言模型的内部状态 / Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs


1️⃣ 一句话总结

这篇论文研究发现,语音大模型性能不如纯文本模型的关键原因,并非简单的特征分布差异,而在于模型难以将语音信号中冗余、分散的语义信息高效地压缩成稳定的高层决策。

源自 arXiv: 2603.01502