菜单

🤖 系统
📄 Abstract - V-Thinker: Interactive Thinking with Images

Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising "Thinking with Images" paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.

顶级标签: multi-modal agents model training
详细标签: interactive reasoning vision-language models reinforcement learning data synthesis benchmark evaluation 或 搜索:

📄 论文总结

V-Thinker:基于图像的交互式思考 / V-Thinker: Interactive Thinking with Images


1️⃣ 一句话总结

这篇论文提出了一个名为V-Thinker的多模态AI助手,它通过自动生成数据和强化学习训练,使模型能够与图像进行深度交互并完成复杂的视觉推理任务,在多项测试中超越了现有方法。


📄 打开原文 PDF