← 返回列表

arXiv 提交日期: 2026-04-29

📄 Abstract - SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we propose a novel adversarial framework where a Generator model, trained to create sophisticated attack prompts, is jointly optimized with a Defender model tasked with their detection. This system is trained using a loss function inspired by Information Retrieval Generative Adversarial Networks, which fosters a dynamic co-evolution between the two models, forcing the Defender to develop robust capabilities against continuously improving attack strategies. The resulting framework demonstrates significantly enhanced resilience to novel and evolving threats compared to static defenses, thereby establishing a critical foundation for securing the integrity of peer review.

顶级标签: llm systems

安全评审：保护基于大语言模型的同行评审系统免受对抗性隐藏提示攻击 / SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

1️⃣ 一句话总结

本文提出了一种由生成器和防御器组成的对抗训练框架，通过动态对抗博弈提高大语言模型评审系统抵御恶意嵌入攻击的能力，从而保障学术评审的公正性。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2604.26506

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要