Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods

📄 Abstract - Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods

Speech Emotion Recognition (SER) research has faced limitations due to the lack of standard and sufficiently large datasets. Recent studies have leveraged pre-trained models to extract features for downstream tasks such as SER. This work explores the capabilities of Whisper, a pre-trained ASR system, in speech emotion recognition by proposing two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, designed to efficiently reduce the dimensionality of Whisper representations while preserving emotional features. We experiment on English and Persian, using the IEMOCAP and ShEMO datasets respectively, with Whisper Tiny and Small. Our multi-head QKV architecture achieves state-of-the-art results on the ShEMO dataset, with a 2.47% improvement in unweighted accuracy. We further compare the performance of different Whisper encoder layers and find that intermediate layers often perform better for SER on the Persian dataset, providing a lightweight and efficient alternative to much larger models such as HuBERT X-Large. Our findings highlight the potential of Whisper as a representation extractor for SER and demonstrate the effectiveness of attention-based pooling for dimension reduction.

利用OpenAI Whisper表征与注意力池化方法进行语音情感识别 / Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods

1️⃣ 一句话总结

这篇论文提出了一种利用OpenAI的Whisper语音识别模型来提取语音情感特征，并结合两种新型的注意力池化方法，在英语和波斯语数据集上实现了高效且高性能的情感识别，为轻量级语音情感分析提供了新方案。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要