CPPO: Contrastive Perception for Vision Language Policy Optimization

📄 Abstract - CPPO: Contrastive Perception for Vision Language Policy Optimization

We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones. Experiments show that CPPO surpasses previous perception-rewarding methods, while avoiding extra models, making training more efficient and scalable.

CPPO：用于视觉语言策略优化的对比感知方法 / CPPO: Contrastive Perception for Vision Language Policy Optimization

1️⃣ 一句话总结

这篇论文提出了一种名为CPPO的新方法，它通过分析模型在图像扰动下的输出变化来自动识别视觉感知信息，并引入对比损失来优化多模态模型的训练，从而在不需要额外模型或复杂标注的情况下，更高效地提升视觉语言模型的综合推理能力。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要