菜单

🤖 系统
📄 Abstract - Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.

顶级标签: agents multi-modal model training
详细标签: multimodal agents interleaved reasoning supervised fine-tuning tool usage visual planning 或 搜索:

Skywork-R1V4:通过图像与深度研究的交替思考迈向具身多模态智能 / Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch


1️⃣ 一句话总结

这篇论文提出了一个名为Skywork-R1V4的新型多模态智能体模型,它通过将图像处理与网络搜索深度结合并交替推理,仅用少量高质量数据训练就实现了超越现有顶尖模型的复杂任务解决能力。


📄 打开原文 PDF