事实核查数据集究竟在测试什么?一项推理路径分析 / What Do Claim Verification Datasets Actually Test? A Reasoning Trace Analysis
1️⃣ 一句话总结
这篇论文通过分析九个主流事实核查数据集发现,现有基准测试主要考察的是信息检索和简单匹配能力,而真正需要多句信息整合、数值推理等复杂推理能力的任务则严重不足,导致模型的高分并不能反映其真正的推理水平。
Despite rapid progress in claim verification, we lack a systematic understanding of what reasoning these benchmarks actually exercise. We generate structured reasoning traces for 24K claim-verification examples across 9 datasets using GPT-4o-mini and find that direct evidence extraction dominates, while multi-sentence synthesis and numerical reasoning are severely under-represented. A dataset-level breakdown reveals stark biases: some datasets almost exclusively test lexical matching, while others require information synthesis in roughly half of cases. Using a compact 1B-parameter reasoning verifier, we further characterize five error types and show that error profiles vary dramatically by domain -- general-domain verification is dominated by lexical overlap bias, scientific verification by overcautiousness, and mathematical verification by arithmetic reasoning failures. Our findings suggest that high benchmark scores primarily reflect retrieval-plus-entailment ability. We outline recommendations for building more challenging evaluation suites that better test the reasoning capabilities verification systems need.
事实核查数据集究竟在测试什么?一项推理路径分析 / What Do Claim Verification Datasets Actually Test? A Reasoning Trace Analysis
这篇论文通过分析九个主流事实核查数据集发现,现有基准测试主要考察的是信息检索和简单匹配能力,而真正需要多句信息整合、数值推理等复杂推理能力的任务则严重不足,导致模型的高分并不能反映其真正的推理水平。
源自 arXiv: 2604.01657