菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-14
📄 Abstract - Towards Long-horizon Agentic Multimodal Search

Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in this https URL.

顶级标签: agents multi-modal model training
详细标签: multimodal search long-horizon reasoning file-based representation data synthesis visual grounding 或 搜索:

迈向长视野的自主多模态搜索 / Towards Long-horizon Agentic Multimodal Search


1️⃣ 一句话总结

这篇论文提出了一种名为LMM-Searcher的新框架,通过将视觉信息存储在外部文件并用轻量级文本标识符来管理,解决了多模态智能体在长时间、多步骤搜索任务中信息混杂和计算成本高的问题,从而实现了更高效、更准确的长序列多模态搜索。

源自 arXiv: 2604.12890