菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-08
📄 Abstract - Structured Role-Aware Policy Optimization for Multimodal Reasoning

Reinforcement learning from verifiable rewards (RLVR), especially with Group Relative Policy Optimization (GRPO), has shown strong potential for improving the reasoning capabilities of large vision-language models (LVLMs). However, in multimodal reasoning, final-answer rewards are typically assigned at the sequence level and do not distinguish the functional roles of different tokens, making it difficult to determine whether a correct answer is supported by task-relevant visual evidence. In this paper, we revisit multimodal RLVR from the perspective of role-aware token-level credit assignment, where structured responses are decomposed into perception tokens for extracting visual evidence and reasoning tokens for deriving answers from that evidence. Based on this perspective, we propose Structured Role-aware Policy Optimization (SRPO), which refines the sequence-level GRPO advantage into role-aware token-level advantages without changing the reward function. Specifically, SRPO assigns role-specific credit by using self-distilled on-policy contrasts: perception tokens are emphasized according to their visual dependency under original versus corrupted visual inputs, while reasoning tokens are emphasized according to their consistency with the generated perception. These role-specific signals are further unified through a shared trajectory-level baseline, yielding positive token weights that adjust relative update magnitudes while preserving the original GRPO reward and optimization direction, without requiring external reward models or separate teachers. Experiments across diverse multimodal reasoning benchmarks show that SRPO improves evidence-grounded reasoning, highlighting the importance of moving beyond uniform sequence-level credit toward role-aware optimization for reliable multimodal reasoning.

顶级标签: multi-modal reinforcement learning model training
详细标签: group relative policy optimization vision-language models credit assignment reasoning policy optimization 或 搜索:

结构化角色感知策略优化用于多模态推理 / Structured Role-Aware Policy Optimization for Multimodal Reasoning


1️⃣ 一句话总结

本文提出了一种结构化角色感知策略优化方法,通过将多模态回答中的感知和推理令牌分开并分别赋予不同权重,从而在无需额外评估模型的情况下,提升大型视觉语言模型在推理时对视觉证据的正确利用和答案的可靠性。

源自 arXiv: 2605.07274