菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-09
📄 Abstract - What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

Applying steering vectors to large language models (LLMs) is an efficient and effective model alignment technique, but we lack an interpretable explanation for how it works-- specifically, what internal mechanisms steering vectors affect and how this results in different model outputs. To investigate the causal mechanisms underlying the effectiveness of steering vectors, we conduct a comprehensive case study on refusal. We propose a multi-token activation patching framework and discover that different steering methodologies leverage functionally interchangeable circuits when applied at the same layer. These circuits reveal that steering vectors primarily interact with the attention mechanism through the OV circuit while largely ignoring the QK circuit-- freezing all attention scores during steering drops performance by only 8.75% across two model families. A mathematical decomposition of the steered OV circuit further reveals semantically interpretable concepts, even in cases where the steering vector itself does not. Leveraging the activation patching results, we show that steering vectors can be sparsified by up to 90-99% while retaining most performance, and that different steering methodologies agree on a subset of important dimensions.

顶级标签: llm theory model evaluation
详细标签: steering vectors mechanistic interpretability activation patching refusal behavior circuit analysis 或 搜索:

什么驱动了表征转向?关于引导拒绝行为的机制性案例研究 / What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal


1️⃣ 一句话总结

这篇论文通过研究如何用“转向向量”改变大语言模型的拒绝行为,揭示了这种技术主要通过影响模型内部注意力机制中的特定电路来发挥作用,并且发现这些向量可以被大幅精简而保持效果,为理解模型对齐技术提供了清晰的内部机制解释。

源自 arXiv: 2604.08524