用于第一人称行为理解的视线正则化视觉语言模型 / Gaze-Regularized VLMs for Ego-Centric Behavior Understanding
1️⃣ 一句话总结
这项研究提出了一种将人眼视线信息融入视觉语言模型的新方法,通过让模型学习并模仿人的注意力模式,显著提升了模型在第一人称视角下预测未来行为和描述动作细节的能力。
Eye gaze, encompassing fixations and saccades, provides critical insights into human intentions and future actions. This study introduces a gaze-regularized framework that enhances Vision Language Models (VLMs) for egocentric behavior understanding. Unlike existing methods that rely solely on visual data and overlook gaze information, our approach directly incorporates gaze information into the VLM architecture during training. By generating gaze-based queries, the model dynamically focuses on gaze-highlighted regions, while a gaze-regularization mechanism ensures the alignment of model attention with human attention patterns. To better understand how gaze can be effectively integrated into VLMs, we conducted extensive experiments exploring various strategies for incorporating gaze data. These innovations enable the prediction of future events with detailed action descriptions. Experimental results demonstrate a nearly 13 % improvement in semantic scores compared to baseline models not leveraging gaze data, highlighting the effectiveness of our approach. This work establishes a foundation for leveraging the human gaze in VLMs, significantly boosting their predictive capabilities in applications requiring accurate and robust future event prediction.
用于第一人称行为理解的视线正则化视觉语言模型 / Gaze-Regularized VLMs for Ego-Centric Behavior Understanding
这项研究提出了一种将人眼视线信息融入视觉语言模型的新方法,通过让模型学习并模仿人的注意力模式,显著提升了模型在第一人称视角下预测未来行为和描述动作细节的能力。
源自 arXiv: 2603.23190