菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-10
📄 Abstract - MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self-evolution process. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning. Moving beyond prior dual-role (Proposer and Solver) setups, MM-Zero introduces a multi-role self-evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g., Python, SVG) to render visual images; and a Solver that performs multimodal reasoning over the generated visual content. All three roles are initialized from the same base model and trained using Group Relative Policy Optimization (GRPO), with carefully designed reward mechanisms that integrate execution feedback, visual verification, and difficulty balancing. Our experiments show that MM-Zero improves VLM reasoning performance across a wide range of multimodal benchmarks. MM-Zero establishes a scalable path toward self-evolving multi-model systems for multimodal models, extending the frontier of self-improvement beyond the conventional two-model paradigm.

顶级标签: multi-modal model training agents
详细标签: vision language models self-evolution reinforcement learning zero-shot learning multimodal reasoning 或 搜索:

MM-Zero:从零数据出发的自进化多模型视觉语言模型 / MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data


1️⃣ 一句话总结

这篇论文提出了一个名为MM-Zero的新框架,它能让视觉语言模型在没有初始图像数据的情况下,通过让一个基础模型扮演提议者、编码者和解答者三个不同角色进行自我协作与进化,从而显著提升其在多模态推理任务上的表现。

源自 arXiv: 2603.09206