VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring

📄 Abstract - VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring

As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount -- physical actions carry immediate and irreversible consequences that digital errors do not. We introduce the Vision-Language Embodied Safety Agent (VLESA), a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. VLESA addresses intent-dependent safety where identical actions can be safe or dangerous depending on context. A dataset pairing egocentric frames with goal-conditioned safety annotations is introduced, enabling a goal-conditioned safety Q-filter trained via GRPO that evaluates actions with respect to inferred intent without retraining. On top of that, an intent-action prediction agent is proposed to jointly infer goals and predict future actions from video. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy at the exact ground-truth frame compared to baselines, while the GRPO-trained Q-filter improves action safety by over 41 percentage points through goal-conditioned constrained decoding. Code is available at this https URL.

VLESA：面向人类活动监测的视觉语言具身安全代理 / VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring

1️⃣ 一句话总结

该论文提出了一种名为VLESA的智能安全监控框架，它能通过分析第一人称视频来实时识别人类即将做出的危险动作，并智能区分同一动作在不同意图下的安全性（例如切菜时刀是安全的，但指向人则危险），从而在关键时刻触发安全干预，大幅提升具身AI系统在物理世界中的安全性。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要