安全评审:保护基于大语言模型的同行评审系统免受对抗性隐藏提示攻击 / SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts
1️⃣ 一句话总结
本文提出了一种由生成器和防御器组成的对抗训练框架,通过动态对抗博弈提高大语言模型评审系统抵御恶意嵌入攻击的能力,从而保障学术评审的公正性。
As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we propose a novel adversarial framework where a Generator model, trained to create sophisticated attack prompts, is jointly optimized with a Defender model tasked with their detection. This system is trained using a loss function inspired by Information Retrieval Generative Adversarial Networks, which fosters a dynamic co-evolution between the two models, forcing the Defender to develop robust capabilities against continuously improving attack strategies. The resulting framework demonstrates significantly enhanced resilience to novel and evolving threats compared to static defenses, thereby establishing a critical foundation for securing the integrity of peer review.
安全评审:保护基于大语言模型的同行评审系统免受对抗性隐藏提示攻击 / SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts
本文提出了一种由生成器和防御器组成的对抗训练框架,通过动态对抗博弈提高大语言模型评审系统抵御恶意嵌入攻击的能力,从而保障学术评审的公正性。
源自 arXiv: 2604.26506