ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

📄 Abstract - ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained paralinguistic distinctions underexplored. We introduce ParaPairAudioBench, a pairwise benchmark of 5,175 audio pairs across five paralinguistic dimensions: Style, Rate, Emphasis, Age, and Gender. Our experiments show that current LALM judges still lag behind human judgments by 32%p on average and exhibit severe calibration failures, particularly in Tie cases where the correct decision is to abstain. To further analyze lexical versus acoustic reliance, the benchmark includes both same-transcript and cross-transcript conditions. ParaPairAudioBench enables multi-dimensional, calibration-aware assessment of the reliability of LALM-as-a-Judge for paralinguistic speech evaluation.

ParaPairAudioBench：用于评估大语言音频模型裁判能力的副语言成对音频基准测试 / ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

1️⃣ 一句话总结

这篇论文提出了一个名为ParaPairAudioBench的基准测试，包含5175对音频样本，专门用来检验大语言音频模型（LALM）在评估说话风格、语速、重音、年龄和性别这五种副语言特征时的表现，结果发现目前最好的模型在判断准确率上比人类还低32个百分点，并且经常在应该表示“无法判断”时做出错误选择。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要