← 返回列表

arXiv 提交日期: 2026-01-05

📄 Abstract - Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

Fine-tuning safety-aligned large language models (LLMs) can substantially compromise their safety. Previous approaches require many safety samples or calibration sets, which not only incur significant computational overhead during realignment but also lead to noticeable degradation in model utility. Contrary to this belief, we show that safety alignment can be fully recovered with only a single safety example, without sacrificing utility and at minimal cost. Remarkably, this recovery is effective regardless of the number of harmful examples used in fine-tuning or the size of the underlying model, and convergence is achieved within just a few epochs. Furthermore, we uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible. We validate our findings across five safety-aligned LLMs and multiple datasets, demonstrating the generality of our approach.

顶级标签: llm model training model evaluation

一针见血的安全修复：用单个实例修补微调后的大语言模型 / Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

1️⃣ 一句话总结

这篇论文发现，只需使用一个安全示例，就能高效且低成本地修复因微调而受损的大语言模型安全性，且不会影响模型的其他有用功能，其有效性源于安全梯度具有的低秩结构。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2601.01887

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要