解码前的拒绝:检测并利用大型语言模型中间激活中的拒绝信号 / Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations
1️⃣ 一句话总结
本研究通过分析大型语言模型在生成回复前的内部神经元活动,发现模型在早期阶段就已经“决定”是否拒绝回答危险问题,并利用这一发现开发了一种更高效的攻击方法,能够显著加快生成有害指令的速度,同时减少计算资源消耗。
In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear probes trained on residual stream activations at each transformer block. We find that refusal is linearly decodable well before the final layer, indicating that safety-relevant behavior is represented in intermediate activations before output generation. To test whether this signal is actionable, we introduce Mechanistic AutoDAN, a probe-guided variant of AutoDAN that replaces full-model fitness evaluation with partial forward passes and probe-based scoring inside a genetic prompt search loop. Across the evaluated models, our method achieves attack success rates competitive with vanilla AutoDAN while reducing per-iteration search time by up to 72%, and probe-guided prompts match or exceed AutoDAN's cross-model transfer in several configurations. We further find that the usefulness of probe guidance increases with model scale. Our results show that refusal is not only observable at the output level, but is encoded as a structured and actionable signal in intermediate LLM activations.
解码前的拒绝:检测并利用大型语言模型中间激活中的拒绝信号 / Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations
本研究通过分析大型语言模型在生成回复前的内部神经元活动,发现模型在早期阶段就已经“决定”是否拒绝回答危险问题,并利用这一发现开发了一种更高效的攻击方法,能够显著加快生成有害指令的速度,同时减少计算资源消耗。
源自 arXiv: 2605.28553