菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-24
📄 Abstract - Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition

Audio-Language Models (ALMs) are making strides in understanding speech and non-speech audio. However, domain-specialist Foundation Models (FMs) remain the best for closed-ended speech processing tasks such as Speech Emotion Recognition (SER). Using ALMs for Zero-shot SER is a popular choice, but their potential to work with specialists to achieve state-of-the-art (SOTA) performance remains unexplored. We propose ZS-Fuse, a late-fusion method that combines zero-shot emotion estimates from a dual-encoder ALM with specialist FMs. To handle ambiguity in emotions and sensitivity to prompt choice, 1) we use a simple prompt ensemble and 2) suggest a novel technique called prompt amplification, which repeats audio and text queries to discover stronger zero-shot capabilities. We demonstrate the efficacy of our technique by evaluating ZS-Fuse with three dual-encoder ALMs and two FMs, and report improvements over SOTA baselines, such as WavLM-Large, on three speech emotion recognition datasets.

顶级标签: audio multi-modal model evaluation
详细标签: speech emotion recognition zero-shot learning late fusion prompt engineering audio-language models 或 搜索:

音频-语言模型中的提示放大与零样本后期融合用于语音情感识别 / Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition


1️⃣ 一句话总结

这篇论文提出了一种名为ZS-Fuse的新方法,通过将通用音频-语言模型的零样本情感预测与专业语音模型的输出进行后期融合,并结合提示词集成与创新的提示放大技术,有效提升了语音情感识别的准确率,在多个数据集上超越了现有先进模型。

源自 arXiv: 2603.23057