菜单

🤖 系统
📄 Abstract - Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale

Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale; and (2) complexity. Reasoning traces are then synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models. Remarkably, we show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU), demonstrating its effectiveness. Similarly, despite not containing videos or embodied visual data, we observe notable gains when evaluating on a single-evidence embodied QA benchmark (NiEH). Finally, we use our data to analyze the entire VLM post-training pipeline. Our empirical analysis highlights that (i) SFT on high-quality data with non-linear reasoning traces is essential for effective online RL, (ii) staged offline RL matches online RL's performance while reducing compute demands, and (iii) careful SFT on high quality data can substantially improve out-of-domain, cross-modality transfer.

顶级标签: multi-modal model training model evaluation
详细标签: visual reasoning data synthesis reinforcement learning vision-language models benchmark evaluation 或 搜索:

📄 论文总结

长链基础思维:大规模提炼组合式视觉推理链条 / Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale


1️⃣ 一句话总结

这篇论文提出了一个大规模生成高质量视觉推理数据集的新方法,通过两阶段合成框架创造出包含复杂推理链条的百万级问题,实验证明基于该数据训练的模型在多项视觉及跨模态推理任务上超越了现有最优模型。


📄 打开原文 PDF