菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-04
📄 Abstract - Efficient Refusal Ablation in LLM through Optimal Transport

Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent activation-based jailbreaking methods circumvent these safety mechanisms by applying orthogonal projections to remove refusal directions, but these approaches treat refusal as a one-dimensional phenomenon and ignore the rich distributional structure of model activations. We introduce a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones. By combining PCA with closed-form Gaussian optimal transport, we achieve efficient computation in high-dimensional representation spaces while preserving essential geometric structure. Across six models (Llama-2, Llama-3.1, Qwen-2.5; 7B-32B parameters), our method achieves up to 11% higher attack success rates than state-of-the-art baselines while maintaining comparable perplexity, demonstrating superior preservation of model capabilities. Critically, we discover that layer-selective intervention (applying optimal transport to 1-2 carefully chosen layers at approximately 40-60% network depth) substantially outperforms full-network interventions, revealing that refusal mechanisms may be localized rather than distributed. Our analysis provides new insights into the geometric structure of safety representations and suggests that current alignment methods may be vulnerable to distributional attacks beyond simple direction removal.

顶级标签: llm model evaluation theory
详细标签: safety alignment jailbreaking optimal transport activation distribution refusal ablation 或 搜索:

基于最优传输的大语言模型高效拒绝行为消除 / Efficient Refusal Ablation in LLM through Optimal Transport


1️⃣ 一句话总结

这篇论文提出了一种基于最优传输理论的新方法,通过将模型内部有害激活的整体分布转换为无害分布,来更有效地破解大语言模型的安全防护机制,并发现安全机制可能集中在网络的特定层而非全局分布。

源自 arXiv: 2603.04355