KITE:基于视觉语言模型的机器人故障分析之关键帧索引与标记化证据框架 / KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
1️⃣ 一句话总结
这篇论文提出了一种名为KITE的免训练方法,它能将冗长的机器人操作视频自动浓缩成一组包含关键动作画面和物体布局示意图的简洁、可解释的“证据包”,从而让通用视觉语言模型能更准确、高效地分析机器人任务中的故障类型、位置和原因。
We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird's-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, allowing the same front-end to support failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5-VL substantially improves over vanilla Qwen2.5-VL in the training-free setting, with especially large gains on simulation failure detection, identification, and localization, while remaining competitive with a RoboFAC-tuned baseline. A small QLoRA fine-tune further improves explanation and correction quality. We also report qualitative results on real dual-arm robots, demonstrating the practical applicability of KITE as a structured and interpretable front-end for robot failure analysis. Code and models are released on our project page: this https URL
KITE:基于视觉语言模型的机器人故障分析之关键帧索引与标记化证据框架 / KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
这篇论文提出了一种名为KITE的免训练方法,它能将冗长的机器人操作视频自动浓缩成一组包含关键动作画面和物体布局示意图的简洁、可解释的“证据包”,从而让通用视觉语言模型能更准确、高效地分析机器人任务中的故障类型、位置和原因。
源自 arXiv: 2604.07034