菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-02
📄 Abstract - VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring

As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount -- physical actions carry immediate and irreversible consequences that digital errors do not. We introduce the Vision-Language Embodied Safety Agent (VLESA), a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. VLESA addresses intent-dependent safety where identical actions can be safe or dangerous depending on context. A dataset pairing egocentric frames with goal-conditioned safety annotations is introduced, enabling a goal-conditioned safety Q-filter trained via GRPO that evaluates actions with respect to inferred intent without retraining. On top of that, an intent-action prediction agent is proposed to jointly infer goals and predict future actions from video. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy at the exact ground-truth frame compared to baselines, while the GRPO-trained Q-filter improves action safety by over 41 percentage points through goal-conditioned constrained decoding. Code is available at this https URL.

顶级标签: agents computer vision multi-modal
详细标签: safety monitoring egocentric video intent prediction goal-conditioned q-filter grpo training 或 搜索:

VLESA:面向人类活动监测的视觉语言具身安全代理 / VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring


1️⃣ 一句话总结

该论文提出了一种名为VLESA的智能安全监控框架,它能通过分析第一人称视频来实时识别人类即将做出的危险动作,并智能区分同一动作在不同意图下的安全性(例如切菜时刀是安全的,但指向人则危险),从而在关键时刻触发安全干预,大幅提升具身AI系统在物理世界中的安全性。

源自 arXiv: 2606.03954