菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-18
📄 Abstract - Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology

Large language models (LLMs) perform strongly on biological benchmarks, raising concerns that they may help novice actors acquire dual-use laboratory skills. Yet, whether this translates to improved human performance in the physical laboratory remains unclear. To address this, we conducted a pre-registered, investigator-blinded, randomized controlled trial (June-August 2025; n = 153) evaluating whether LLMs improve novice performance in tasks that collectively model a viral reverse genetics workflow. We observed no significant difference in the primary endpoint of workflow completion (5.2% LLM vs. 6.6% Internet; P = 0.759), nor in the success rate of individual tasks. However, the LLM arm had numerically higher success rates in four of the five tasks, most notably for the cell culture task (68.8% LLM vs. 55.3% Internet; P = 0.059). Post-hoc Bayesian modeling of pooled data estimates an approximate 1.4-fold increase (95% CrI 0.74-2.62) in success for a "typical" reverse genetics task under LLM assistance. Ordinal regression modelling suggests that participants in the LLM arm were more likely to progress through intermediate steps across all tasks (posterior probability of a positive effect: 81%-96%). Overall, mid-2025 LLMs did not substantially increase novice completion of complex laboratory procedures but were associated with a modest performance benefit. These results reveal a gap between in silico benchmarks and real-world utility, underscoring the need for physical-world validation of AI biosecurity assessments as model capabilities and user proficiency evolve.

顶级标签: biology llm model evaluation
详细标签: biosecurity randomized controlled trial benchmark gap laboratory skills dual-use 或 搜索:

评估2025年中期的AI助手对生物学新手实验表现的影响 / Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology


1️⃣ 一句话总结

这项研究发现,截至2025年中,大型语言模型(AI助手)虽然未能显著提高新手完成复杂生物实验流程的整体成功率,但在具体实验步骤中能带来小幅度的表现提升,揭示了AI在模拟测试中的能力与实际实验室应用效果之间存在差距。

源自 arXiv: 2602.16703