📄
Abstract - SGAP-Gaze: Scene Grid Attention Based Point-of-Gaze Estimation Network for Driver Gaze
Driver gaze estimation is essential for understanding the driver's situational awareness of surrounding traffic. Existing gaze estimation models use driver facial information to predict the Point-of-Gaze (PoG) or the 3D gaze direction vector. We propose a benchmark dataset, Urban Driving-Face Scene Gaze (UD-FSG), comprising synchronized driver-face and traffic-scene images. The scene images provide cues about surrounding traffic, which can help improve the gaze estimation model, along with the face images. We propose SGAP-Gaze, Scene-Grid Attention based Point-of-Gaze estimation network, trained and tested on our UD-FSG dataset, which explicitly incorporates the scene images into the gaze estimation modelling. The gaze estimation network integrates driver face, eye, iris, and scene contextual information. First, the extracted features from facial modalities are fused to form a gaze intent vector. Then, attention scores are computed over the spatial scene grid using a Transformer-based attention mechanism fusing face and scene image features to obtain the PoG. The proposed SGAP-Gaze model achieves a mean pixel error of 104.73 on the UD-FSG dataset and 63.48 on LBW dataset, achieving a 23.5% reduction in mean pixel error compared to state-of-the-art driver gaze estimation models. The spatial pixel distribution analysis shows that SGAP-Gaze consistently achieves lower mean pixel error than existing methods across all spatial ranges, including the outer regions of the scene, which are rare but critical for understanding driver attention. These results highlight the effectiveness of integrating multi-modal gaze cues with scene-aware attention for a robust driver PoG estimation model in real-world driving environments.
SGAP-Gaze:基于场景网格注意力的驾驶员注视点估计网络 /
SGAP-Gaze: Scene Grid Attention Based Point-of-Gaze Estimation Network for Driver Gaze
1️⃣ 一句话总结
该论文提出了一种结合驾驶员面部信息和周围交通场景图像的新型注视点估计模型SGAP-Gaze,通过引入基于Transformer的场景网格注意力机制,将视觉线索与场景上下文融合,在真实驾驶环境下显著提升了注视点预测的精度,特别是在场景边缘区域也有更好的表现。