菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-11
📄 Abstract - Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of language-switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Using activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads process trigger information. Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor triggers do not form isolated circuits but instead co-opt the model's existing language components. These findings have implications for backdoor defense: detection methods may benefit from monitoring known functional components rather than searching for hidden circuits, and mitigation strategies could potentially leverage this entanglement between injected and natural behaviors.

顶级标签: llm model evaluation theory
详细标签: backdoor attacks mechanistic interpretability activation patching language switching security 或 搜索:

触发器劫持语言回路:对大语言模型中后门行为的机制分析 / Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models


1️⃣ 一句话总结

这篇论文通过机制分析发现,大语言模型中的后门攻击并非创建独立回路,而是通过劫持模型原有的语言控制功能(如输出语言切换)来实现恶意行为,这为通过监控已知功能组件来检测和防御后门提供了新思路。

源自 arXiv: 2602.10382