菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-09
📄 Abstract - Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.

顶级标签: multi-modal evaluation robotics
详细标签: benchmark physical tool use embodied ai perception planning 或 搜索:

超越API:探索多模态大语言模型在物理工具使用中的极限 / Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use


1️⃣ 一句话总结

本文提出了首个专门评估多模态大语言模型在现实场景中识别和规划使用物理工具能力的基准测试PhysTool-Bench,结果发现当前最先进的模型在工具感知和功能常识推理上存在严重不足,仅能完成约五分之一的任务,揭示了具身人工智能发展的关键瓶颈。

源自 arXiv: 2606.10803