面向视觉中心推理的谜题课程GRPO / Puzzle Curriculum GRPO for Vision-Centric Reasoning
1️⃣ 一句话总结
这篇论文提出了一种名为PC-GRPO的新方法,它通过设计一系列自监督的视觉谜题任务和动态难度课程,无需人工标注或外部验证器,就能有效提升视觉语言模型的推理能力、训练稳定性和最终答案的准确性。
Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.
面向视觉中心推理的谜题课程GRPO / Puzzle Curriculum GRPO for Vision-Centric Reasoning
这篇论文提出了一种名为PC-GRPO的新方法,它通过设计一系列自监督的视觉谜题任务和动态难度课程,无需人工标注或外部验证器,就能有效提升视觉语言模型的推理能力、训练稳定性和最终答案的准确性。
源自 arXiv: 2512.14944