ScholarGym:基于学术文献检索的深度研究工作流基准测试 / ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval
1️⃣ 一句话总结
这篇论文提出了一个名为ScholarGym的模拟测试平台,它通过一个包含57万篇论文的静态数据库和2500多个专家标注的问题,解决了评估AI进行深度文献研究时因依赖实时网络工具而导致结果不可重复、难以公平比较的难题。
Tool-augmented large language models have advanced from single-turn question answering to deep research workflows that iteratively plan queries, invoke external tools, and synthesize information to address complex information needs. Evaluating such workflows presents a fundamental challenge: reliance on live APIs introduces non-determinism, as tool invocations may yield different results across runs due to temporal drift, rate limiting, and evolving backend states. This variance undermines reproducibility and invalidates cross-system comparisons. We present ScholarGym, a simulation environment for reproducible evaluation of deep research workflows on academic literature. The environment decouples workflow components into query planning, tool invocation, and relevance assessment, enabling fine-grained analysis of each stage under controlled conditions. Built on a static corpus of 570K papers with deterministic retrieval, ScholarGym provides 2,536 queries with expert-annotated ground truth. Experiments across diverse backbone models reveal how reasoning capabilities, planning strategies, and selection mechanisms interact over iterative refinement.
ScholarGym:基于学术文献检索的深度研究工作流基准测试 / ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval
这篇论文提出了一个名为ScholarGym的模拟测试平台,它通过一个包含57万篇论文的静态数据库和2500多个专家标注的问题,解决了评估AI进行深度文献研究时因依赖实时网络工具而导致结果不可重复、难以公平比较的难题。
源自 arXiv: 2601.21654