基于大语言模型的多模态人格识别:面部动作单元与文本语义融合 / LLM-based Multimodal Personality Recognition via Facial Action Unit-Text Semantic Fusion
1️⃣ 一句话总结
该论文提出一种结合面部动作单元(如微笑、皱眉)和文字答案的新方法,通过大语言模型将它们转化为统一语义,从而更准确地评估面试者的性格特征,解决了单一模态信息不足和传统视频分析忽略细微表情变化的问题。
Personality recognition in asynchronous video interviews (AVIs) has become increasingly important due to their widespread adoption in modern recruitment. Existing approaches often rely on large language models (LLMs) to analyze textual responses of interviewees in AVI. However, unimodel methods often suffer from information loss (e.g., ignore facial cues). In contrast, multimodal methods that employ full-face images or sparsely sampled frames can discard fine-grained temporal dynamics critical for accurate personality assessment. To overcome these limitations, we propose an LLM-based framework that semantically fuse facial action units (AUs) with textual responses of AVI. AU sequences are first converted into interpretable textual descriptions, which are then fused with participants' textual responses through an LLM. A lightweight regression head transforms the resulting embeddings into continuous personality scores without disrupting the underlying semantic space. Experiments on the AVI-6 benchmark demonstrate consistent improvements over most baselines, with lower prediction errors and stronger correlations with human-rated scores across multiple traits. Further analysis reveals that AU-derived semantic representations offer complementary non-verbal cues to textual responses. Decoupling semantic understanding from regression prediction within the LLM also leads to greater training stability and clearer interpretability. Overall, these findings demonstrate that AU-text fusion provides a psychologically grounded and computationally efficient framework for personality recognition in AVIs.
基于大语言模型的多模态人格识别:面部动作单元与文本语义融合 / LLM-based Multimodal Personality Recognition via Facial Action Unit-Text Semantic Fusion
该论文提出一种结合面部动作单元(如微笑、皱眉)和文字答案的新方法,通过大语言模型将它们转化为统一语义,从而更准确地评估面试者的性格特征,解决了单一模态信息不足和传统视频分析忽略细微表情变化的问题。
源自 arXiv: 2606.29900