菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-10
📄 Abstract - Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing

Language models have become practical tools for quantum computing education and research, from summarizing technical papers to explaining theoretical concepts and answering questions about recent developments in the field. While existing benchmarks evaluate quantum code generation and circuit design, their understanding of quantum computing concepts has not been systematically measured. Quantum-Audit addresses this gap with 2,700 questions covering core quantum computing topics. We evaluate 26 models from leading organizations. Our benchmark comprises 1,000 expert-written questions, 1,000 questions extracted from research papers using LLMs and validated by experts, plus an additional 700 questions including 350 open-ended questions and 350 questions with false premises to test whether models can correct erroneous assumptions. Human participants scored between 23% and 86%, with experts averaging 74%. Top-performing models exceeded the expert average, with Claude Opus 4.5 reaching 84% accuracy, though top models showed an average 12-point accuracy drop on expert-written questions compared to LLM-generated ones. Performance declined further on advanced topics, dropping to 73% on security questions. Additionally, models frequently accepted and reinforced false premises embedded in questions instead of identifying them, with accuracy below 66% on these critical reasoning tasks.

顶级标签: llm model evaluation benchmark
详细标签: quantum computing reasoning evaluation knowledge assessment false premise detection expert benchmarking 或 搜索:

量子审计:评估大语言模型在量子计算上的推理能力极限 / Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing


1️⃣ 一句话总结

这篇论文通过创建一个名为Quantum-Audit、包含2700个问题的全新基准测试,系统性地评估了26个大语言模型对量子计算概念的理解能力,发现顶尖模型虽然在整体上能超越人类专家平均水平,但在专家编写的题目、高级主题以及识别错误前提的批判性推理任务上表现明显不足。

源自 arXiv: 2602.10092