菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-11
📄 Abstract - StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

Recent advances in robot imitation learning have yielded powerful visuomotor policies capable of manipulating a wide variety of objects directly from monocular visual inputs. However, monocular observations inherently lack reliable depth cues and spatial awareness, which are critical for precise manipulation in cluttered or geometrically complex scenes. To address this limitation, we introduce StereoPolicy, a new visuomotor policy learning framework that directly leverages synchronized stereo image pairs to strengthen geometric reasoning, without requiring explicit 3D reconstruction or camera calibration. StereoPolicy employs pretrained 2D vision encoders to process each image independently and fuses the resulting representations through a Stereo Transformer. This design implicitly captures spatial correspondence and disparity cues. The framework integrates seamlessly with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks: RoboMimic, RoboCasa, and OmniGibson. We further validate StereoPolicy on real-robot experiments spanning both tabletop and bimanual mobile manipulation settings. Our results underscore stereo vision as a scalable and robust modality that bridges 2D pretrained representations with 3D geometric understanding for robotic manipulation.

顶级标签: robotics computer vision
详细标签: visuomotor policy stereo perception imitation learning manipulation geometric reasoning 或 搜索:

立体策略:通过立体视觉提升机器人操控策略 / StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception


1️⃣ 一句话总结

本文提出了一种名为StereoPolicy的机器人学习框架,通过利用左右两个摄像头同步拍摄的立体图像对,在不依赖显式3D重建或相机标定的情况下,隐式地理解空间深度和物体形状,从而显著提升机器人在复杂场景中抓取和操控物体的精度和鲁棒性。

源自 arXiv: 2605.09989