从权重到激活:操控是模型适应的下一个前沿吗? / From Weights to Activations: Is Steering the Next Frontier of Adaptation?
1️⃣ 一句话总结
这篇论文提出,通过在模型推理时直接修改其内部激活值来引导模型行为(即“操控”方法),应被视为一种独特且有效的模型适应范式,它无需更新参数就能实现局部、可逆的行为调整,为模型适应研究提供了新的统一视角。
Post-training adaptation of language models is commonly achieved through parameter updates or input-based methods such as fine-tuning, parameter-efficient adaptation, and prompting. In parallel, a growing body of work modifies internal activations at inference time to influence model behavior, an approach known as steering. Despite increasing use, steering is rarely analyzed within the same conceptual framework as established adaptation methods. In this work, we argue that steering should be regarded as a form of model adaptation. We introduce a set of functional criteria for adaptation methods and use them to compare steering approaches with classical alternatives. This analysis positions steering as a distinct adaptation paradigm based on targeted interventions in activation space, enabling local and reversible behavioral change without parameter updates. The resulting framing clarifies how steering relates to existing methods, motivating a unified taxonomy for model adaptation.
从权重到激活:操控是模型适应的下一个前沿吗? / From Weights to Activations: Is Steering the Next Frontier of Adaptation?
这篇论文提出,通过在模型推理时直接修改其内部激活值来引导模型行为(即“操控”方法),应被视为一种独特且有效的模型适应范式,它无需更新参数就能实现局部、可逆的行为调整,为模型适应研究提供了新的统一视角。
源自 arXiv: 2604.14090