菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-10
📄 Abstract - RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning

Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers -- a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in the current policy. These insights are converted into explicit evaluation criteria, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations. Across extensive benchmarks, RubiCap achieves the highest win rates on CapArena, outperforming supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior word efficiency: our 7B model matches Qwen2.5-VL-32B-Instruct, and our 3B model surpasses its 7B counterpart. Remarkably, using the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than those trained on captions from proprietary models.

顶级标签: computer vision natural language processing model training
详细标签: dense image captioning reinforcement learning vision-language models llm-guided evaluation reward modeling 或 搜索:

RubiCap:基于评分标准的强化学习用于密集图像描述生成 / RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning


1️⃣ 一句话总结

这篇论文提出了一种名为RubiCap的新方法,它利用大型语言模型自动生成详细的评分标准来指导强化学习训练,从而在无需昂贵人工标注的情况下,高效地生成质量更高、更多样化的图像描述。

源自 arXiv: 2603.09160