菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-14
📄 Abstract - Calibrated Confidence Estimation for Tabular Question Answering

Large language models (LLMs) are increasingly deployed for tabular question answering, yet calibration on structured data is largely unstudied. This paper presents the first systematic comparison of five confidence estimation methods across five frontier LLMs and two tabular QA benchmarks. All models are severely overconfident (smooth ECE 0.35-0.64 versus 0.10-0.15 reported for textual QA). A consistent self-evaluation versus perturbation dichotomy replicates across both benchmarks and all four fully-covered models: self-evaluation methods (verbalized, P(True)) achieve AUROC 0.42-0.76, while perturbation methods (semantic entropy, self-consistency, and our Multi-Format Agreement) achieve AUROC 0.78-0.86. Per-model paired bootstrap tests reject the null at p<0.001 after Holm-Bonferroni correction, and a 3-seed check on GPT-4o-mini gives a per-seed standard deviation of only 0.006. The paper proposes Multi-Format Agreement (MFA), which exploits the lossless and deterministic serialization variation unique to structured data (Markdown, HTML, JSON, CSV) to estimate confidence at 20% lower API cost than sampling baselines. MFA reduces ECE by 44-63%, generalizes across all four models on TableBench (mean AUROC 0.80), and combines complementarily with sampling: an MFA + self-consistency ensemble lifts AUROC from 0.74 to 0.82. A secondary contribution, structure-aware recalibration, improves AUROC by +10 percentage points over standard post-hoc methods.

顶级标签: llm model evaluation natural language processing
详细标签: confidence calibration tabular question answering uncertainty estimation model reliability structured data 或 搜索:

面向表格问答的校准置信度估计 / Calibrated Confidence Estimation for Tabular Question Answering


1️⃣ 一句话总结

这篇论文首次系统性地研究了大型语言模型在表格问答任务中的置信度校准问题,发现模型普遍过于自信,并提出了一种名为‘多格式一致性’的新方法,该方法能利用表格数据的不同序列化格式来更准确、更低成本地估计模型答案的可靠性。

源自 arXiv: 2604.12491