PhysBrain:以人类第一视角数据为桥梁,连接视觉语言模型与物理智能 / PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence
1️⃣ 一句话总结
这篇论文提出了一种新方法,通过大规模处理人类第一视角视频,将其转化为机器人能学习的结构化训练数据,从而有效提升了机器人对物理世界的理解和任务规划能力。
Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. However, most VLMs are trained primarily on third-person data, creating a fundamental viewpoint mismatch for humanoid robots. Scaling robot egocentric data collection remains impractical due to high cost and limited diversity, whereas large-scale human egocentric videos offer a scalable alternative that naturally capture rich interaction context and causal structure. The key challenge is to convert raw egocentric videos into structured and reliable embodiment training supervision. Accordingly, we propose an Egocentric2Embodiment translation pipeline that transforms first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding and temporal consistency, enabling the construction of the Egocentric2Embodiment dataset (E2E-3M) at scale. An egocentric-aware embodied brain, termed PhysBrain, is obtained by training on the E2E-3M dataset. PhysBrain exhibits substantially improved egocentric understanding, particularly for planning on EgoThink. It provides an egocentric-aware initialization that enables more sample-efficient VLA fine-tuning and higher SimplerEnv success rates (53.9\%), demonstrating effective transfer from human egocentric supervision to downstream robot control.
PhysBrain:以人类第一视角数据为桥梁,连接视觉语言模型与物理智能 / PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence
这篇论文提出了一种新方法,通过大规模处理人类第一视角视频,将其转化为机器人能学习的结构化训练数据,从而有效提升了机器人对物理世界的理解和任务规划能力。
源自 arXiv: 2512.16793