MMDeepResearch-Bench:面向多模态深度研究智能体的基准测试 / MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents
1️⃣ 一句话总结
这篇论文提出了一个名为MMDeepResearch-Bench的新基准测试,专门用于评估多模态深度研究智能体如何利用图像和文本证据来生成带引用的长篇研究报告,并开发了一套可解释的评估方法来诊断模型在报告质量、引用忠实度和图文一致性方面的系统性问题。
Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce MMDeepResearch-Bench (MMDR-Bench), a benchmark of 140 expert-crafted tasks across 21 domains, where each task provides an image-text bundle to evaluate multimodal understanding and citation-grounded report generation. Compared to prior setups, MMDR-Bench emphasizes report-style synthesis with explicit evidence use, where models must connect visual artifacts to sourced claims and maintain consistency across narrative, citations, and visual references. We further propose a unified, interpretable evaluation pipeline: Formula-LLM Adaptive Evaluation (FLAE) for report quality, Trustworthy Retrieval-Aligned Citation Evaluation (TRACE) for citation-grounded evidence alignment, and Multimodal Support-Aligned Integrity Check (MOSAIC) for text-visual integrity, each producing fine-grained signals that support error diagnosis beyond a single overall score. Experiments across 25 state-of-the-art models reveal systematic trade-offs between generation quality, citation discipline, and multimodal grounding, highlighting that strong prose alone does not guarantee faithful evidence use and that multimodal integrity remains a key bottleneck for deep research agents.
MMDeepResearch-Bench:面向多模态深度研究智能体的基准测试 / MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents
这篇论文提出了一个名为MMDeepResearch-Bench的新基准测试,专门用于评估多模态深度研究智能体如何利用图像和文本证据来生成带引用的长篇研究报告,并开发了一套可解释的评估方法来诊断模型在报告质量、引用忠实度和图文一致性方面的系统性问题。
源自 arXiv: 2601.12346