Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

📄 Abstract - Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show that successful reasoning is characterized by structured token dynamics reflecting both perceptual grounding and exploratory inference. Building upon this analysis, we propose Perception-Exploration Policy Optimization (PEPO), which derives a perception prior from hidden state similarity and integrates it with token entropy through a smooth gating mechanism to produce token-level advantages. PEPO integrates seamlessly with existing RLVR frameworks such as GRPO and DAPO, requiring neither additional supervision nor auxiliary branches. Extensive experiments across diverse multimodal benchmarks demonstrate consistent and robust improvements over strong RL baselines, spanning geometry reasoning, visual grounding, visual puzzle solving, and few-shot classification, while maintaining stable training dynamics. Code: this https URL

重新思考多模态思维链的令牌级策略优化 / Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

1️⃣ 一句话总结

这篇论文提出了一种名为PEPO的新方法，通过精细分析多模态推理过程中每个令牌的动态特性，并利用感知先验和探索性机制来优化模型，从而在多类视觉语言推理任务上稳定且显著地提升了性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要