ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval

📄 Abstract - ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval

Tool-augmented large language models have advanced from single-turn question answering to deep research workflows that iteratively plan queries, invoke external tools, and synthesize information to address complex information needs. Evaluating such workflows presents a fundamental challenge: reliance on live APIs introduces non-determinism, as tool invocations may yield different results across runs due to temporal drift, rate limiting, and evolving backend states. This variance undermines reproducibility and invalidates cross-system comparisons. We present ScholarGym, a simulation environment for reproducible evaluation of deep research workflows on academic literature. The environment decouples workflow components into query planning, tool invocation, and relevance assessment, enabling fine-grained analysis of each stage under controlled conditions. Built on a static corpus of 570K papers with deterministic retrieval, ScholarGym provides 2,536 queries with expert-annotated ground truth. Experiments across diverse backbone models reveal how reasoning capabilities, planning strategies, and selection mechanisms interact over iterative refinement.

ScholarGym：基于学术文献检索的深度研究工作流基准测试 / ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval

1️⃣ 一句话总结

这篇论文提出了一个名为ScholarGym的模拟测试平台，它通过一个包含57万篇论文的静态数据库和2500多个专家标注的问题，解决了评估AI进行深度文献研究时因依赖实时网络工具而导致结果不可重复、难以公平比较的难题。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要