Omni-R1:迈向统一生成式多模态推理范式 / Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
1️⃣ 一句话总结
这篇论文提出了一个名为Omni-R1的统一生成式多模态推理框架,它通过在推理过程中生成中间图像来整合多种视觉推理能力,从而能够灵活应对多种不同的多模态任务。
Multimodal Large Language Models (MLLMs) are making significant progress in multimodal reasoning. Early approaches focus on pure text-based reasoning. More recent studies have incorporated multimodal information into the reasoning steps; however, they often follow a single task-specific reasoning pattern, which limits their generalizability across various multimodal tasks. In fact, there are numerous multimodal tasks requiring diverse reasoning skills, such as zooming in on a specific region or marking an object within an image. To address this, we propose unified generative multimodal reasoning, which unifies diverse multimodal reasoning skills by generating intermediate images during the reasoning process. We instantiate this paradigm with Omni-R1, a two-stage SFT+RL framework featuring perception alignment loss and perception reward, thereby enabling functional image generation. Additionally, we introduce Omni-R1-Zero, which eliminates the need for multimodal annotations by bootstrapping step-wise visualizations from text-only reasoning data. Empirical results show that Omni-R1 achieves unified generative reasoning across a wide range of multimodal tasks, and Omni-R1-Zero can match or even surpass Omni-R1 on average, suggesting a promising direction for generative multimodal reasoning.
Omni-R1:迈向统一生成式多模态推理范式 / Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
这篇论文提出了一个名为Omni-R1的统一生成式多模态推理框架,它通过在推理过程中生成中间图像来整合多种视觉推理能力,从而能够灵活应对多种不同的多模态任务。
源自 arXiv: 2601.09536