FlyAOC:评估果蝇科学知识库的智能体本体论构建 / FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases
1️⃣ 一句话总结
这篇论文提出了一个名为FlyBench的新基准测试,用于评估AI智能体如何像专家一样,从海量科学文献中自动搜索、阅读并整理出关于果蝇基因的结构化知识,发现多智能体架构表现更好,但仍远未达到专家水平,为未来AI辅助科学研究指明了方向。
Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology-grounded annotations - a workflow that existing benchmarks, focused on isolated subtasks like named entity recognition or relation extraction, do not capture. We present FlyBench to evaluate AI agents on end-to-end agentic ontology curation from scientific literature. Given only a gene symbol, agents must search and read from a corpus of 16,898 full-text papers to produce structured annotations: Gene Ontology terms describing function, expression patterns, and historical synonyms linking decades of nomenclature. The benchmark includes 7,397 expert-curated annotations across 100 genes drawn from FlyBase, the Drosophila (fruit fly) knowledge base. We evaluate four baseline agent architectures: memorization, fixed pipeline, single-agent, and multi-agent. We find that architectural choices significantly impact performance, with multi-agent designs outperforming simpler alternatives, yet scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Our analysis surfaces several findings to guide future development; for example, agents primarily use retrieval to confirm parametric knowledge rather than discover new information. We hope FlyBench will drive progress on retrieval-augmented scientific reasoning, a capability with broad applications across scientific domains.
FlyAOC:评估果蝇科学知识库的智能体本体论构建 / FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases
这篇论文提出了一个名为FlyBench的新基准测试,用于评估AI智能体如何像专家一样,从海量科学文献中自动搜索、阅读并整理出关于果蝇基因的结构化知识,发现多智能体架构表现更好,但仍远未达到专家水平,为未来AI辅助科学研究指明了方向。
源自 arXiv: 2602.09163