When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

📄 Abstract - When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

Automatic speech recognition (ASR) systems remain brittle on dysarthric and other atypical speech. Recent audio-language models raise the possibility of improving performance by conditioning on additional clinical context at inference time, but it is unclear whether these models can make use of such information. We introduce a benchmark built on the Speech Accessibility Project (SAP) dataset that tests whether diagnosis labels, clinician-derived speech ratings, and progressively richer clinical descriptions improve transcription accuracy for dysarthric speech. Across matched comparisons on nine models, we find that current models do not meaningfully use this context: diagnosis-informed and clinically detailed prompts yield negligible improvements and often degrade word error rate. We complement the prompting analysis with context-dependent fine-tuning, showing that LoRA adaptation with a mixture of clinical prompt formats achieves a WER of 0.066, a 52% relative reduction over the frozen baseline, while preserving performance when context is unavailable. Subgroup analyses reveal significant gains for Down syndrome and mild-severity speakers. These results clarify where current models fall short and provide a testbed for measuring progress toward more inclusive ASR.

当音频语言模型无法利用多模态上下文进行构音障碍语音识别时 / When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

1️⃣ 一句话总结

本文研究发现，当前最先进的音频语言模型在识别构音障碍语音时，无法有效利用诊断标签、临床评分等附加的多模态上下文信息来提升识别准确率，但通过特殊的微调方法（LoRA）可以将词错误率降低52%，尤其对唐氏综合征和中度障碍患者效果显著。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要