菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-28
📄 Abstract - ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations

To establish empathy with machines, it is essential to fully understand human emotional changes. However, research in multimodal emotion recognition often overlooks one problem: individual expressive traits vary significantly, which means that different people may express emotions differently. In our daily lives, we can see this. When communicating with different people, some express "happiness" through their facial expressions and words, while others may hide their happiness or express it through their actions. Both are expressions of 'happiness,' but such differences in emotional expression are still too difficult for machines to distinguish. Current emotion recognition remains at a 'static' level, using a single recognition model to identify all emotional styles. This "simplification" often affects the recognition results, especially in multi-turn dialogues. To address this problem, this paper introduces a novel Multi-Level Speaker Adaptive Network (ML-SAN), which, specifically, effectively addresses the challenge of speaker identity information confusion. ML-SAN does not simply assign a speaker's ID after recognition; instead, it employs a three-stage adaptive process: First, Input-level Calibration uses Feature-Level Linear Modulation (FiLM) to adjust the raw audio and visual features into a neutral space unrelated to the speaker. Then, Interaction-level Gating re-adjusts the trust level for each modality (e.g., voice or facial features) based on the speaker's identity information. Finally, Output-level Regularization maintains the consistency of speaker features in the latent space. Tests on the MELD and IEMOCAP datasets show that our model (ML-SAN) achieves better results, performs exceptionally well in handling challenging tail sentiment categories, and better addresses the diversity of speakers in real-world scenarios.

顶级标签: multi-modal emotion recognition
详细标签: speaker adaptation conversation multimodal fusion feature modulation emotion recognition 或 搜索:

ML-SAN:用于对话情感识别的多层级说话人自适应网络 / ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations


1️⃣ 一句话总结

本文提出了一种名为ML-SAN的新型网络,通过输入层校准、交互层门控和输出层正则化三个步骤,让机器能根据不同人的表达习惯(如有人用表情、有人用动作表达同一种情绪)动态调整情感识别方式,从而在多轮对话中更准确地识别情感,特别是那些不常见的情绪类型。

源自 arXiv: 2604.25383