AdaTooler-V:面向图像与视频的自适应工具使用模型 / AdaTooler-V: Adaptive Tool-Use for Images and Videos
1️⃣ 一句话总结
这篇论文提出了一个名为AdaTooler-V的多模态大模型,它通过智能判断何时需要调用视觉工具来解决问题,从而在减少不必要计算开销的同时,显著提升了在图像和视频任务上的推理准确率,其性能甚至超过了GPT-4o等顶尖商业模型。
Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.
AdaTooler-V:面向图像与视频的自适应工具使用模型 / AdaTooler-V: Adaptive Tool-Use for Images and Videos
这篇论文提出了一个名为AdaTooler-V的多模态大模型,它通过智能判断何时需要调用视觉工具来解决问题,从而在减少不必要计算开销的同时,显著提升了在图像和视频任务上的推理准确率,其性能甚至超过了GPT-4o等顶尖商业模型。
源自 arXiv: 2512.16918