菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-05
📄 Abstract - C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no single judge dominating all settings; all judges exhibit a substantial gap between detecting an error and localizing it; and coverage judgments are systematically inflated for incomplete reasoning. These findings clarify when LLM judges are dependable and where they fail, and provide practical guidance for selecting judges in process-level evaluation

顶级标签: llm model evaluation benchmark
详细标签: faithfulness evaluation chain-of-thought reasoning assessment judge llm process evaluation 或 搜索:

C2-Faith:评估大语言模型作为思维链推理中因果与覆盖忠实性的评判者 / C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning


1️⃣ 一句话总结

这篇论文提出了一个名为C2-Faith的新基准,用于测试大语言模型能否有效评估思维链推理的忠实性(包括逻辑因果和步骤完整性),结果发现现有模型在不同任务上表现不一,且难以精确定位错误,为如何选用合适的AI评判者提供了实用指导。

源自 arXiv: 2603.05167