Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition

📄 Abstract - Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition

Audio-Language Models (ALMs) are making strides in understanding speech and non-speech audio. However, domain-specialist Foundation Models (FMs) remain the best for closed-ended speech processing tasks such as Speech Emotion Recognition (SER). Using ALMs for Zero-shot SER is a popular choice, but their potential to work with specialists to achieve state-of-the-art (SOTA) performance remains unexplored. We propose ZS-Fuse, a late-fusion method that combines zero-shot emotion estimates from a dual-encoder ALM with specialist FMs. To handle ambiguity in emotions and sensitivity to prompt choice, 1) we use a simple prompt ensemble and 2) suggest a novel technique called prompt amplification, which repeats audio and text queries to discover stronger zero-shot capabilities. We demonstrate the efficacy of our technique by evaluating ZS-Fuse with three dual-encoder ALMs and two FMs, and report improvements over SOTA baselines, such as WavLM-Large, on three speech emotion recognition datasets.

音频-语言模型中的提示放大与零样本后期融合用于语音情感识别 / Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition

1️⃣ 一句话总结

这篇论文提出了一种名为ZS-Fuse的新方法，通过将通用音频-语言模型的零样本情感预测与专业语音模型的输出进行后期融合，并结合提示词集成与创新的提示放大技术，有效提升了语音情感识别的准确率，在多个数据集上超越了现有先进模型。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要