菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-26
📄 Abstract - Object Pose and Shape Estimation for Grasping: Does it Work?

The problem of object pose and shape estimation has seen key advancements lately. Encoder-decoder (e.g., SAM3D, LRM, CRISP) and diffusion-based models (e.g., InstantMesh, Zero123, SceneComplete) have shown category-agnostic shape encoding capacity and open-set generalizability. In this work, we ask the question: Are the object pose and shape estimation methods mature enough, such that when used with antipodal grasp sampling, can outperform the end-to-end grasp synthesis methods? We explore this question in detail by scoping our study to parallel jaw grippers, 7-DoF grasps, and single-view RGB(-D) image as input. We implement and compare a state-of-the-art, end-to-end grasp synthesis method and three modular methods, which first estimate the object pose and shape for all objects in the scene, and generate grasps using antipodal sampling. We observe that the modular methods outperform the end-to-end method in all our experiments. The modular methods are able to synthesize plenty of grasps, even for small objects, where the end-to-end methods fail. The effectiveness of the modular methods is contingent on the accuracy of the pose and shape estimation, and suffers partial degradation in cluttered scenes - a limitation of the existing pose and shape estimation methods. We also analyze the failure modes and run-times for the three modular methods, which use two different ways of object pose and shape estimation: one based on an encoder-decoder model, while another a diffusion model. Finally, we demonstrate that the single-view object pose and shape estimation methods can be augmented with vision-language models to yield language-conditioned grasps from just single-view RGB-D image as input. We notice comparable performance to the state-of-the-art LERF-TOGO baseline.

顶级标签: robotics computer vision agents
详细标签: grasping object pose estimation shape estimation modular methods evaluation 或 搜索:

面向抓取的目标位姿与形状估计:它真的有效吗? / Object Pose and Shape Estimation for Grasping: Does it Work?


1️⃣ 一句话总结

本文系统比较了先估计物体位姿和形状再采样抓取点的模块化方法与直接端到端生成抓取的方法,发现模块化方法在所有测试中表现更好,尤其能抓取小物体,但其性能依赖于位姿与形状估计的精度,并且在杂乱场景中会有所下降。

源自 arXiv: 2605.26944