菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-02
📄 Abstract - There Is More to Refusal in Large Language Models than a Single Direction

Prior work argues that refusal in large language models is mediated by a single activation-space direction, enabling effective steering and ablation. We show that this account is incomplete. Across eleven categories of refusal and non-compliance, including safety, incomplete or unsupported requests, anthropomorphization, and over-refusal, we find that these refusal behaviors correspond to geometrically distinct directions in activation space. Yet despite this diversity, linear steering along any refusal-related direction produces nearly identical refusal to over-refusal trade-offs, acting as a shared one-dimensional control knob. The primary effect of different directions is not whether the model refuses, but how it refuses.

顶级标签: llm model evaluation theory
详细标签: refusal behavior activation steering model safety interpretability latent space 或 搜索:

大语言模型中的拒绝行为不止于单一方向 / There Is More to Refusal in Large Language Models than a Single Direction


1️⃣ 一句话总结

这篇论文研究发现,大语言模型中的拒绝行为并非由单一的激活方向控制,而是对应多个几何上不同的方向,但这些方向都像一个共享的‘一维旋钮’,主要影响模型‘如何拒绝’而非‘是否拒绝’的核心行为。

源自 arXiv: 2602.02132