MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

📄 Abstract - MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.

MomaGraph：用于具身任务规划的、具备状态感知能力的统一场景图与视觉语言模型 / MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

1️⃣ 一句话总结

这篇论文提出了一个名为MomaGraph的统一场景表示方法，它结合了空间、功能和物体状态信息，并配套发布了首个大规模任务驱动场景图数据集与评估基准，同时训练了一个能根据场景图进行零样本任务规划的视觉语言模型，显著提升了家庭环境中移动机械臂的任务规划性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要