菜单

🤖 系统
📄 Abstract - Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at this https URL.

顶级标签: multi-modal agents model training
详细标签: vision-language reasoning self-evolving agents tool integration reinforcement learning autonomous evaluation 或 搜索:

Agent0-VL:通过工具集成推理实现自我演化的视觉语言智能体 / Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning


1️⃣ 一句话总结

Agent0-VL是一个创新的视觉语言模型框架,通过在单一模型中统一求解器和验证器两个协同角色,结合工具验证和强化学习,实现了无需外部奖励的闭环自我改进。


2️⃣ 论文创新点

1. 双角色统一架构

2. 工具增强的自评估与自修复

3. 自演化推理循环(SERC)

4. 工具基础生成式验证


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

📄 打开原文 PDF