菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-25
📄 Abstract - scBench-Long: Verifiable Benchmarking of Long-Horizon Single-Cell Biology

Single-cell studies require analysts to convert raw measurements into specific biological claims through multi-step workflows and integration of metadata, assay context, and auxiliary evidence. Existing AI-biology benchmarks largely measure broad knowledge, executable workflows, or local analysis steps. We introduce scBench-Long, a benchmark for long-horizon single-cell biology in which agents must recover scientific conclusions from raw or near-raw data without prescribed methods. The benchmark contains 21 evaluations spanning melanoma CD8 T-cell reactivity, CD8 RNA+ATAC regulatory inference, human--monkey chimera development, KRAS-driven lung tumor aging, and lethal COVID-19 lung pathology. Tasks cover paired scRNA/TCR sequencing, RNA and chromatin profiling, cross-species transcriptomics, combinatorial scRNA-seq, single-nucleus RNA-seq, immune repertoires, ortholog maps, ligand--receptor resources, and validation evidence. Candidate claims are reproduced, reviewed, and converted into controlled answer vocabularies with deterministic grading and trajectory rubrics. Across 1,068 completed trajectories, the strongest model--harness pair passes 16/63 runs (25.4\%). scBench-Long evaluates whether agents can move beyond local analysis steps and make complex scientific claims that are supported by single-cell data.

顶级标签: biology agents benchmark
详细标签: single-cell biology long-horizon verifiable benchmark scientific reasoning multi-step workflows 或 搜索:

scBench-Long:长程单细胞生物学的可验证基准测试 / scBench-Long: Verifiable Benchmarking of Long-Horizon Single-Cell Biology


1️⃣ 一句话总结

本文提出了一个名为scBench-Long的基准测试,旨在评估AI系统能否像科学家一样,从单细胞原始数据出发,通过多步骤的复杂分析,最终得出有科学依据的结论。

源自 arXiv: 2606.26563