← 返回列表

arXiv 提交日期: 2026-02-02

📄 Abstract - There Is More to Refusal in Large Language Models than a Single Direction

Prior work argues that refusal in large language models is mediated by a single activation-space direction, enabling effective steering and ablation. We show that this account is incomplete. Across eleven categories of refusal and non-compliance, including safety, incomplete or unsupported requests, anthropomorphization, and over-refusal, we find that these refusal behaviors correspond to geometrically distinct directions in activation space. Yet despite this diversity, linear steering along any refusal-related direction produces nearly identical refusal to over-refusal trade-offs, acting as a shared one-dimensional control knob. The primary effect of different directions is not whether the model refuses, but how it refuses.

顶级标签: llm model evaluation theory

大语言模型中的拒绝行为不止于单一方向 / There Is More to Refusal in Large Language Models than a Single Direction

1️⃣ 一句话总结

这篇论文研究发现，大语言模型中的拒绝行为并非由单一的激活方向控制，而是对应多个几何上不同的方向，但这些方向都像一个共享的‘一维旋钮’，主要影响模型‘如何拒绝’而非‘是否拒绝’的核心行为。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2602.02132

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要