一针见血的安全修复:用单个实例修补微调后的大语言模型 / Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance
1️⃣ 一句话总结
这篇论文发现,只需使用一个安全示例,就能高效且低成本地修复因微调而受损的大语言模型安全性,且不会影响模型的其他有用功能,其有效性源于安全梯度具有的低秩结构。
Fine-tuning safety-aligned large language models (LLMs) can substantially compromise their safety. Previous approaches require many safety samples or calibration sets, which not only incur significant computational overhead during realignment but also lead to noticeable degradation in model utility. Contrary to this belief, we show that safety alignment can be fully recovered with only a single safety example, without sacrificing utility and at minimal cost. Remarkably, this recovery is effective regardless of the number of harmful examples used in fine-tuning or the size of the underlying model, and convergence is achieved within just a few epochs. Furthermore, we uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible. We validate our findings across five safety-aligned LLMs and multiple datasets, demonstrating the generality of our approach.
一针见血的安全修复:用单个实例修补微调后的大语言模型 / Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance
这篇论文发现,只需使用一个安全示例,就能高效且低成本地修复因微调而受损的大语言模型安全性,且不会影响模型的其他有用功能,其有效性源于安全梯度具有的低秩结构。
源自 arXiv: 2601.01887