AREG:用于评估大语言模型说服与抵抗能力的对抗性资源提取博弈 / AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models
1️⃣ 一句话总结
这篇论文提出了一个名为AREG的对抗性谈判游戏基准,用于同时评估大语言模型的说服力和抵抗力,发现这两种能力关联性弱且模型普遍更擅长防守,表明仅评估说服力会忽略其行为中的不对称弱点。
Evaluating the social intelligence of Large Language Models (LLMs) increasingly requires moving beyond static text generation toward dynamic, adversarial interaction. We introduce the Adversarial Resource Extraction Game (AREG), a benchmark that operationalizes persuasion and resistance as a multi-turn, zero-sum negotiation over financial resources. Using a round-robin tournament across frontier models, AREG enables joint evaluation of offensive (persuasion) and defensive (resistance) capabilities within a single interactional framework. Our analysis provides evidence that these capabilities are weakly correlated ($\rho = 0.33$) and empirically dissociated: strong persuasive performance does not reliably predict strong resistance, and vice versa. Across all evaluated models, resistance scores exceed persuasion scores, indicating a systematic defensive advantage in adversarial dialogue settings. Further linguistic analysis suggests that interaction structure plays a central role in these outcomes. Incremental commitment-seeking strategies are associated with higher extraction success, while verification-seeking responses are more prevalent in successful defenses than explicit refusal. Together, these findings indicate that social influence in LLMs is not a monolithic capability and that evaluation frameworks focusing on persuasion alone may overlook asymmetric behavioral vulnerabilities.
AREG:用于评估大语言模型说服与抵抗能力的对抗性资源提取博弈 / AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models
这篇论文提出了一个名为AREG的对抗性谈判游戏基准,用于同时评估大语言模型的说服力和抵抗力,发现这两种能力关联性弱且模型普遍更擅长防守,表明仅评估说服力会忽略其行为中的不对称弱点。
源自 arXiv: 2602.16639