基于伊辛模型的依赖感知标签聚合方法:用于大语言模型作为评估者 / Dependence-Aware Label Aggregation for LLM-as-a-Judge via Ising Models
1️⃣ 一句话总结
这篇论文提出了一种新的标签聚合方法,它通过伊辛模型来考虑不同评估者(包括大语言模型)之间的依赖关系,解决了传统方法因假设评估者相互独立而导致的错误预测问题,并在实际数据上取得了更好的效果。
Large-scale AI evaluation increasingly relies on aggregating binary judgments from $K$ annotators, including LLMs used as judges. Most classical methods, e.g., Dawid-Skene or (weighted) majority voting, assume annotators are conditionally independent given the true label $Y\in\{0,1\}$, an assumption often violated by LLM judges due to shared data, architectures, prompts, and failure modes. Ignoring such dependencies can yield miscalibrated posteriors and even confidently incorrect predictions. We study label aggregation through a hierarchy of dependence-aware models based on Ising graphical models and latent factors. For class-dependent Ising models, the Bayes log-odds is generally quadratic in votes; for class-independent couplings, it reduces to a linear weighted vote with correlation-adjusted parameters. We present finite-$K$ examples showing that methods based on conditional independence can flip the Bayes label despite matching per-annotator marginals. We prove separation results demonstrating that these methods remain strictly suboptimal as the number of judges grows, incurring nonvanishing excess risk under latent factors. Finally, we evaluate the proposed method on three real-world datasets, demonstrating improved performance over the classical baselines.
基于伊辛模型的依赖感知标签聚合方法:用于大语言模型作为评估者 / Dependence-Aware Label Aggregation for LLM-as-a-Judge via Ising Models
这篇论文提出了一种新的标签聚合方法,它通过伊辛模型来考虑不同评估者(包括大语言模型)之间的依赖关系,解决了传统方法因假设评估者相互独立而导致的错误预测问题,并在实际数据上取得了更好的效果。
源自 arXiv: 2601.22336