菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-24
📄 Abstract - QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix (7 judges x 5 solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude Opus 4.5, DeepSeek-V3, Qwen 2.5 Max, and Llama 4 Maverick exhibit significant positive bias (up to +0.18, +0.20, +0.30, +0.36 mean score inflation, respectively). Furthermore, we uncover a critical reasoning gap in the discrete domain: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 average human evaluation score), other reasoning models like GPT-5 Pro and Claude Sonnet 4.5 see their performance significantly degrade in discrete domains. Specifically, their average human evaluation scores drop to 0.72 and 0.63 in Discrete Math, and to 0.74 and 0.50 in Graph Theory. In addition to these research results, we also release QEDBench as a public benchmark for evaluating and improving AI judges. Our benchmark is publicly published at this https URL.

顶级标签: llm model evaluation benchmark
详细标签: automated evaluation mathematical proofs alignment gap human-ai alignment judge bias 或 搜索:

QEDBENCH:量化大学水平数学证明自动评估中的对齐差距 / QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs


1️⃣ 一句话总结

这篇论文通过发布一个名为QEDBench的新基准测试,量化了当前主流大语言模型在评估大学高年级数学证明时,其评分与人类专家评分之间存在显著且系统性的偏差,揭示了自动评估在复杂推理任务上的局限性。

源自 arXiv: 2602.20629