📄
Abstract - EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge
Detecting evasive answers in earnings calls is critical for financial transparency, yet progress is hindered by the lack of large-scale benchmarks. We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples (Cohen's Kappa 0.835) across three evasion levels. Our key contribution is a multi-model annotation framework leveraging a core insight: disagreement between frontier LLMs signals hard examples most valuable for training. We mine boundary cases where two strong annotators conflict, using a judge to resolve labels. This approach outperforms single-model distillation by 2.4 percent, with judge-resolved samples improving generalization despite higher training loss (0.421 vs 0.393) - evidence that disagreement mining acts as implicit regularization. Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at a fraction of inference cost.
EvasionBench:通过多模型共识与LLM作为裁判检测金融问答中的规避性回答 /
EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge
1️⃣ 一句话总结
这篇论文提出了一个名为EvasionBench的大规模数据集和一个创新的多模型标注框架,通过挖掘顶尖AI模型之间的分歧样本来高效训练小型模型,最终训练出的轻量级模型能以极低成本接近顶级大模型的性能,用于准确检测金融问答中企业高管回避问题的回答。