菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-16
📄 Abstract - Panoramic Affordance Prediction

Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.

顶级标签: computer vision agents benchmark
详细标签: affordance prediction panoramic vision embodied ai dataset visual grounding 或 搜索:

全景可供性预测 / Panoramic Affordance Prediction


1️⃣ 一句话总结

这篇论文首次提出并解决了全景图像中的可供性预测问题,通过构建一个大规模数据集并设计一种无需训练、由粗到细的仿生视觉处理流程,显著提升了AI智能体对360度全景环境的整体感知与交互能力。

源自 arXiv: 2603.15558