菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-13
📄 Abstract - PaperScope: A Multi-Modal Multi-Document Benchmark for Agentic Deep Research Across Massive Scientific Papers

Leveraging Multi-modal Large Language Models (MLLMs) to accelerate frontier scientific research is promising, yet how to rigorously evaluate such systems remains unclear. Existing benchmarks mainly focus on single-document understanding, whereas real scientific workflows require integrating evidence from multiple papers, including their text, tables, and figures. As a result, multi-modal, multi-document scientific reasoning remains underexplored and lacks systematic evaluation. To address this gap, we introduce PaperScope, a multi-modal multi-document benchmark designed for agentic deep research. PaperScope presents three advantages: (1) Structured scientific grounding. It is built on a knowledge graph of over 2,000 AI papers spanning three years, providing a structured foundation for research-oriented queries. (2) Semantically dense evidence construction. It integrates semantically related key information nodes and employs optimized random-walk article selector to sample thematically coherent paper sets, thereby ensuring adequate semantic density and task complexity. (3) Multi-task evaluation of scientific reasoning. It contains over 2,000 QA pairs across reasoning, retrieval, summarization, and problem solving, enabling evaluation of multi-step scientific reasoning. Experimental results show that even advanced systems such as OpenAI Deep Research and Tongyi Deep Research achieve limited scores on PaperScope, highlighting the difficulty of long-context retrieval and deep multi-source reasoning. PaperScope thus provides a rigorous benchmark alongside a scalable pipeline for constructing large-scale multi-modal, multi-source deep research datasets.

顶级标签: llm benchmark multi-modal
详细标签: scientific reasoning multi-document retrieval knowledge graphs agent evaluation long-context understanding 或 搜索:

PaperScope:一个用于海量科学论文中智能深度研究的多模态多文档基准测试 / PaperScope: A Multi-Modal Multi-Document Benchmark for Agentic Deep Research Across Massive Scientific Papers


1️⃣ 一句话总结

这篇论文提出了一个名为PaperScope的新基准测试,它通过整合数千篇AI论文中的文本、表格和图表,来系统评估AI模型在多文档、多模态信息下进行深度科学推理和研究的能力,发现当前先进模型在此类复杂任务上仍面临巨大挑战。

源自 arXiv: 2604.11307