菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-17
📄 Abstract - The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety

Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions. We formalize this mechanism through the Alignment Instability Condition, three geometric properties that, when jointly satisfied, lead to safety degradation. Our main result establishes a quartic scaling law: alignment loss grows with the fourth power of training time, governed by the sharpness of alignment geometry and the strength of curvature coupling between the fine-tuning task and safety-critical parameters. These results expose a structural blind spot in the current safety paradigm. The dominant approaches to safe fine-tuning address only the initial snapshot of a fundamentally dynamic problem. Alignment fragility is not a bug to be patched; it is an intrinsic geometric property of gradient descent on curved manifolds. Our results motivate the development of curvature-aware methods, and we hope will further enable a shift in alignment safety analysis from reactive red-teaming to predictive diagnostics for open-weight model deployment.

顶级标签: llm model training theory
详细标签: alignment collapse safety degradation fine-tuning geometric analysis gradient descent 或 搜索:

对齐崩溃的几何学:当微调破坏安全性时 / The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety


1️⃣ 一句话总结

这篇论文发现,即使使用无害数据对已对齐的大语言模型进行微调,也会因为模型参数空间中安全对齐结构固有的几何脆弱性,导致安全护栏在训练过程中被系统地、不可预测地破坏,其根本原因在于梯度下降无法感知和避开高曲率的低维敏感子空间。

源自 arXiv: 2602.15799