📄
Abstract - Continual Safety Alignment via Gradient-Based Sample Selection
Large language models require continuous adaptation to new tasks while preserving safety alignment. However, fine-tuning on even benign data often compromises safety behaviors, including refusal of harmful requests, truthfulness, and commonsense reasoning. We investigate which training samples cause alignment drift through a data-centric lens. Our empirical analysis shows samples contribute unequally: high-gradient samples cause greater safety degradation and drive models toward pretrained distributions, while moderate-gradient samples enable task learning with minimal alignment loss. We propose gradient-based sample selection that filters high-gradient samples during fine-tuning. Across multiple model families on continual domain tasks, our method substantially improves alignment preservation while maintaining competitive task performance, without requiring curated safe data or architectural modifications. Our method is robust across selection ratios, task orderings, and diverse attack benchmarks.
基于梯度样本选择的持续安全对齐方法 /
Continual Safety Alignment via Gradient-Based Sample Selection
1️⃣ 一句话总结
本文研究发现,大语言模型在持续学习新任务时,不同训练样本对安全性的影响差异很大——梯度大的样本容易破坏模型的安全对齐,而梯度适中的样本则能兼顾任务学习和安全保持;基于此,作者提出了一种简单的梯度筛选方法,在微调时自动剔除高梯度样本,从而在不依赖额外安全数据或修改模型结构的情况下,有效防止模型因持续学习而丧失拒绝有害请求、保持诚实等安全能力。