📄 论文总结
SAIL-RL:通过双奖励强化学习指导多模态大语言模型何时及如何思考 / SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning
1️⃣ 一句话总结
这项研究提出了SAIL-RL强化学习框架,通过双重奖励机制教会多模态大语言模型在简单任务中避免过度思考、在复杂任务中充分推理,从而显著提升模型推理能力和可靠性。
We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at this https URL.
SAIL-RL:通过双奖励强化学习指导多模态大语言模型何时及如何思考 / SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning
这项研究提出了SAIL-RL强化学习框架,通过双重奖励机制教会多模态大语言模型在简单任务中避免过度思考、在复杂任务中充分推理,从而显著提升模型推理能力和可靠性。