当个性化使风险合法化:揭示个性化对话代理中的安全漏洞 / When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents
1️⃣ 一句话总结
这篇论文发现,在个性化对话AI中,看似无害的用户记忆会误导模型,使其将有害请求误判为合理,从而大幅增加安全攻击的成功率,并提出了一个基准测试和一种轻量级方法来检测和缓解此风险。
Long-term memory enables large language model (LLM) agents to support personalized and sustained interactions. However, most work on personalized agents prioritizes utility and user experience, treating memory as a neutral component and largely overlooking its safety implications. In this paper, we reveal intent legitimation, a previously underexplored safety failure in personalized agents, where benign personal memories bias intent inference and cause models to legitimize inherently harmful queries. To study this phenomenon, we introduce PS-Bench, a benchmark designed to identify and quantify intent legitimation in personalized interactions. Across multiple memory-augmented agent frameworks and base LLMs, personalization increases attack success rates by 15.8%-243.7% relative to stateless baselines. We further provide mechanistic evidence for intent legitimation from internal representations space, and propose a lightweight detection-reflection method that effectively reduces safety degradation. Overall, our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode that naturally arises from benign, real-world personalization, highlighting the importance of assessing safety under long-term personal context. WARNING: This paper may contain harmful content.
当个性化使风险合法化:揭示个性化对话代理中的安全漏洞 / When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents
这篇论文发现,在个性化对话AI中,看似无害的用户记忆会误导模型,使其将有害请求误判为合理,从而大幅增加安全攻击的成功率,并提出了一个基准测试和一种轻量级方法来检测和缓解此风险。
源自 arXiv: 2601.17887