📄
Abstract - Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning
Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at this https URL.
Agent0-VL:通过工具集成推理实现自我演化的视觉语言智能体 /
Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning
1️⃣ 一句话总结
Agent0-VL是一个创新的视觉语言模型框架,通过在单一模型中统一求解器和验证器两个协同角色,结合工具验证和强化学习,实现了无需外部奖励的闭环自我改进。
2️⃣ 论文创新点
1. 双角色统一架构
- 创新点:在单一视觉语言模型中集成求解器和验证器两种模式,求解器负责多轮工具集成推理,验证器通过工具基础批判生成结构化反馈和细粒度自奖励
- 区别/改进:避免了传统方法需要单独训练验证器模型的问题,实现了自主演化的统一策略设计
- 意义:通过角色交互形成自演化推理循环,促进推理和评估分布的对齐
2. 工具增强的自评估与自修复
- 创新点:将外部工具使用扩展到自我评估和自我修复过程,使模型能够以可验证的方式分析、批判和精炼其推理
- 区别/改进:解决了纯文本自评估在复杂视觉推理中的局限性,包括有限评估能力和不可靠评估过程
- 意义:实现了零外部奖励监督下的闭环自我改进,克服了人类标注监督的限制
3. 自演化推理循环(SERC)
- 创新点:基于工具验证和强化学习共同工作,对齐推理和评估分布的双层训练框架,包含内部循环的数据生成和外部循环的策略演进
- 区别/改进:将学习目标从静态奖励最大化转变为分布自一致性过程,实现稳定的自我改进
- 意义:在几何问题解决和视觉科学分析任务上实现了12.5%的基础模型改进
4. 工具基础生成式验证
- 创新点:验证器结合语言反思和可执行工具证据,生成包含分数、置信度和自然语言批评的反馈元组
- 区别/改进:将验证从静态正确性检查转变为动态评估过程
- 意义:为强化学习提供密集且可解释的反馈信号
3️⃣ 主要结果与价值
结果亮点
- 在7个多模态推理基准测试中,Agent0-VL-7B相比基础模型平均提升12.5%,作为过程奖励模型时测试性能平均提升7.3%
- 在数学推理基准上7B和8B模型分别提升18.1%和7.4%,在感知任务上减少视觉幻觉分别提升12.2%和3.1%
- 连续三次迭代分别带来5.2%、4.0%和2.8%的性能增益,证明了稳定的迭代自我改进能力
- 消融实验验证了SERC框架、工具使用和自我修复模块的必要性,移除任一组件都会导致性能显著下降
实际价值
- 可作为过程奖励模型为外部模型的输出分配奖励并进行轨迹选择,在各种规模的视觉语言模型上实现更强的轨迹选择和更精细的步骤级判别
- 在复杂数学推理、视觉科学分析和事实验证任务中表现出色,特别适合需要多步推理和事实一致性的应用场景
- 实现了零外部奖励监督下的自主改进,降低了模型优化对人工标注的依赖
- 统一的架构设计简化了部署复杂度,单个模型即可完成推理、验证和自我修复的完整流程
4️⃣ 术语表
- Agent0-VL:自演化的视觉语言智能体,通过工具集成推理实现持续自我改进,在单一模型中统一了推理、验证和自我修复功能
- SERC:自演化推理循环,包含内部推理修复循环和外部强化学习优化循环的双层训练框架,支持模型通过迭代实现持续改进
- GRPO:组相对策略优化,用于生成任务的PPO变体,基于组相对优势优化策略
- 工具基础生成式验证:结合语言反思和工具证据的验证机制,产生密集且可解释的反馈信号用于强化学习
- 双角色统一架构:在单一视觉语言模型中集成求解器和验证器两种模式,通过角色切换实现自主演化的统一策略设计