菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-09
📄 Abstract - Thinking with Images via Self-Calling Agent

Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9\%$ with $\sim 75\%$ fewer GPU hours compared to strong baseline approaches. Code is available at this https URL.

顶级标签: multi-modal agents model training
详细标签: visual reasoning agent coordination reinforcement learning efficient training high-resolution vision 或 搜索:

自调用思维链:一种用于高效视觉推理的新型代理协调范式 / Thinking with Images via Self-Calling Agent


1️⃣ 一句话总结

本文提出了一种名为自调用思维链(sCoT)的新型视觉推理范式,通过将复杂的跨模态推理任务重构为由主代理协调的纯语言原子子任务序列,并利用强化学习进行端到端优化,显著降低了训练成本并提升了模型在高分辨率视觉任务上的推理性能。


2️⃣ 论文创新点

1. 自调用思维链(sCoT)范式

2. 基于强化学习的代理协调优化

3. 虚拟子代理与结构化工具调用协议

4. 边界框上下文增强与数据针对性设计


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

源自 arXiv: 2512.08511