Language-Switching Triggers Take a Latent Detour Through Language Models

📄 Abstract - Language-Switching Triggers Take a Latent Detour Through Language Models

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.

语言切换触发器在语言模型中的潜在迂回路径 / Language-Switching Triggers Take a Latent Detour Through Language Models

1️⃣ 一句话总结

本研究揭示了一种在大型语言模型中的“后门攻击”机制：通过一个由三个拉丁词组成的触发序列（九个令牌），模型会自动将英文输出切换为法文，这一过程并非直接改变模型的语言识别方向，而是通过注意力头将触发信息汇聚到中间层的一个特定位置，再经过MLP层将其转化为法文输出，从而绕过了基于语言特征的传统防御手段。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要