菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-18
📄 Abstract - Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.

顶级标签: llm benchmark model evaluation
详细标签: scientific general intelligence workflow evaluation benchmarking test-time reinforcement learning multimodal reasoning 或 搜索:

利用科学家对齐的工作流程探究大语言模型的科学通用智能 / Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows


1️⃣ 一句话总结

这篇论文提出了一个基于‘实践探究模型’的科学通用智能定义,并通过一个包含千余个跨学科样本的基准测试,系统评估了大语言模型在模拟科学家完整工作流程(如深度研究、实验设计等)中的能力,揭示了其在可行性、细节和推理方面的显著不足,并引入了一种无需参考答案即可提升假设新颖性的推理时强化学习方法。

源自 arXiv: 2512.16969