SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

📄 Abstract - SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at this https URL.

SciCustom：一种用于大型语言模型科学能力定制化评估的框架 / SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

1️⃣ 一句话总结

该研究提出了SciCustom框架，通过将科学知识组织成可控制粒度的知识单元，并利用多模型投票和二分搜索等技术，能够从大规模数据中自动构建针对特定应用场景的评测基准，从而更细致、更高效地评估大型语言模型在化学和医疗等领域的实际科学能力，且无需专家标注或手动生成问题。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要