菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-21
📄 Abstract - Rethinking Video Generation Model for the Embodied World

Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high-quality training data. Driven by these insights, we introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with comprehensive physical property annotations. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward general intelligence.

顶级标签: video generation robotics benchmark
详细标签: embodied ai dataset evaluation metrics physical realism synthetic data 或 搜索:

为具身世界重新思考视频生成模型 / Rethinking Video Generation Model for the Embodied World


1️⃣ 一句话总结

这篇论文通过创建一个名为RBench的标准化机器人视频生成评测基准和一个包含400万标注视频片段的大型开源数据集RoVid-X,旨在解决现有模型难以生成物理真实机器人行为的问题,为具身人工智能的发展提供了评估和训练的基础。

源自 arXiv: 2601.15282