Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models

📄 Abstract - Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models

Egocentric vision offers a first-person view of human perception and decision making, yet its potential for traffic-safety prediction remains underexplored. In this work, we study the decoding of pedestrian crossing intentions from short egocentric video clips. We approach this by formulating the task as a closed-ended visual question answering (VQA) problem and leveraging vision language models (VLMs) to predict the pedestrians' intent. We first benchmark three families of state-of-the-art VLMs in a zero-shot setting, finding that they achieve moderate gains over random guessing but exhibit limited higher-level traffic reasoning. Motivated by these findings, we further adapt VLMs to the target task using parameter-efficient fine-tuning. Our results show that the fine-tuned models substantially outperform their zero-shot counterparts and achieve a 9\% accuracy improvement over a specialized transformer-based baseline. Finally, we demonstrate that incorporating additional contextual cues, including ego motion, vehicle motion, and eye gaze, further improves predictive performance. In particular, the fine-tuned Qwen3-VL-2B model guided by eye gaze and ego motion achieves a 14.5% accuracy improvement over the transformer baseline, establishing a new state of the art for egocentric pedestrian intent decoding.

通过视觉语言模型从第一人称视角解码行人过街意图 / Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models

1️⃣ 一句话总结

本研究利用视觉语言模型分析第一人称视角的短视频，通过将其转化为问答任务来预测行人是否要过马路，并发现微调后的模型比零样本方法和传统模型更准确，结合自身运动、车辆运动和视线等额外信息后，准确率可提升14.5%。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要