📄
Abstract - Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft
Discovering causal regularities and applying them to build functional systems--the discovery-to-application loop--is a hallmark of general intelligence, yet evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based benchmark that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences); scaling target parameters substantially increases construction complexity and required knowledge, forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5 under a general-purpose code agent scaffold, we find that all plateau at approximately 26% success rate. To diagnose these failures, we decompose the loop into four capacities--knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application--and design targeted interventions whose marginal contributions serve as proxies for corresponding gaps. Our analysis reveals that although the general knowledge application capability still remains as the biggest gap across all models, for frontier models the knowledge gap identification starts to become a major hurdle--indicating the bottleneck is shifting from solving problems right to raising the right problems for current AI. We release SciCrafter as a diagnostic probe for future research on AI systems that navigate the full discovery-to-application loop.
当前智能体能否弥合从发现到应用的鸿沟?以《我的世界》为案例的研究 /
Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft
1️⃣ 一句话总结
本文通过在《我的世界》游戏中设计一系列需要自主发现规律并应用来点亮指定红石灯的任务,评测了GPT-5.2等前沿AI模型,发现它们仅能达到约26%的成功率,且模型的主要瓶颈正从“如何正确解决问题”转向“如何提出正确的问题”,即识别知识缺口的能力成为新的关键挑战。