📄
Abstract - KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs
Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment of multilingual speech capabilities. Straightforward benchmark transfer through ASR, translation, normalization, and TTS can corrupt language-specific instructions, answer constraints, and spoken forms; for audio understanding, transferring source-language audio also fails to preserve target-language speaker attributes, accents, and paralinguistic properties. To address these limitations, we propose two human-agent benchmark-construction frameworks: one transfers source-language SpokenQA benchmarks into target-language SpokenQA benchmarks, and the other converts target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. Using these frameworks, we construct and publicly release three Korean speech benchmarks: KVoiceBench and KOpenAudioBench for Korean SpokenQA, and KMMAU for Korean audio understanding, comprising 12,345 samples in total. We evaluate eight recent SpeechLMs and find that English-Korean performance gaps vary substantially across models and task families, and that SpokenQA and audio understanding rankings diverge, revealing complementary weaknesses invisible to English-only evaluation.
KVoiceBench、KOpenAudioBench和KMMAU:用于评估语音语言模型的语言驱动型韩语语音基准测试集 /
KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs
1️⃣ 一句话总结
本文针对当前语音语言模型评测过度集中于英语的问题,提出了两种将现有英文或通用语音基准转化为高质量韩语基准的方法,并基于此创建了三个包含1.2万多个样本的韩语评测集(KVoiceBench、KOpenAudioBench和KMMAU),实验发现不同模型在英语和韩语上的表现差异很大,且口语问答与音频理解能力之间存在互补性的短板,揭示了仅用英语评测无法暴露的多语言性能缺陷。