InterPol:通过插值偏好学习对LM Arena进行去匿名化 / InterPol: De-anonymizing LM Arena via Interpolated Preference Learning
1️⃣ 一句话总结
这篇论文提出了一种名为INTERPOL的新方法,它通过合成模型间的混合数据并学习其深层风格特征,能够有效地识别出匿名大语言模型的真实身份,从而揭示了类似LM Arena这类投票排行榜存在严重的安全漏洞。
Strict anonymity of model responses is a key for the reliability of voting-based leaderboards, such as LM Arena. While prior studies have attempted to compromise this assumption using simple statistical features like TF-IDF or bag-ofwords, these methods often lack the discriminative power to distinguish between stylistically similar or within-family models. To overcome these limitations and expose the severity of vulnerability, we introduce INTERPOL, a model-driven identification framework that learns to distinguish target models from others using interpolated preference data. Specifically, INTERPOL captures deep stylistic patterns that superficial statistical features miss by synthesizing hard negative samples through model interpolation and employing an adaptive curriculum learning strategy. Extensive experiments demonstrate that INTERPOL significantly outperforms existing baselines in identification accuracy. Furthermore, we quantify the real-world threat of our findings through ranking manipulation simulations on Arena battle data.
InterPol:通过插值偏好学习对LM Arena进行去匿名化 / InterPol: De-anonymizing LM Arena via Interpolated Preference Learning
这篇论文提出了一种名为INTERPOL的新方法,它通过合成模型间的混合数据并学习其深层风格特征,能够有效地识别出匿名大语言模型的真实身份,从而揭示了类似LM Arena这类投票排行榜存在严重的安全漏洞。
源自 arXiv: 2603.15220