隐私崩溃:良性的微调可能破坏语言模型中的上下文隐私 / Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models
1️⃣ 一句话总结
这篇论文发现,即使是为了提升性能而进行的良性微调,也可能意外地破坏大型语言模型保护用户隐私的能力,使其在不该泄露信息时泄露信息,而这一隐患在常规的安全测试中很难被发现。
We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.
隐私崩溃:良性的微调可能破坏语言模型中的上下文隐私 / Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models
这篇论文发现,即使是为了提升性能而进行的良性微调,也可能意外地破坏大型语言模型保护用户隐私的能力,使其在不该泄露信息时泄露信息,而这一隐患在常规的安全测试中很难被发现。
源自 arXiv: 2601.15220