菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-16
📄 Abstract - From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.

顶级标签: robotics reinforcement learning multi-modal
详细标签: process reasoning video mllm robotic manipulation reinforcement learning fine-tuning benchmark evaluation 或 搜索:

从被动观察者到主动批评家:强化学习激发机器人操作的过程推理 / From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation


1️⃣ 一句话总结

这篇论文提出了一个名为PRIMO R1的新框架,它利用强化学习训练小型视频模型,使其从单纯识别动作的‘观察者’转变为能主动评估任务进展的‘批评家’,从而在复杂的机器人操作任务中实现了更准确的过程监控和状态评估。

源自 arXiv: 2603.15600