Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs

📄 Abstract - Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs

The proliferation of open-weight Large Language Models (LLMs) has democratized agentic AI, yet fine-tuned weights are frequently shared and adopted with limited scrutiny beyond leaderboard performance. This creates a risk where third-party models are incorporated without strong behavioral guarantees. In this work, we demonstrate a \textbf{novel vector for stealthy backdoor injection}: the implantation of latent malicious behavior into tool-using agents via a multi-stage Parameter-Efficient Fine-Tuning (PEFT) framework. Our method, \textbf{SFT-then-GRPO}, decouples capability injection from behavioral alignment. First, we use SFT with LoRA to implant a "sleeper agent" capability. Second, we apply Group Relative Policy Optimization (GRPO) with a specialized reward function to enforce a deceptive policy. This reinforces two behaviors: (1) \textbf{Trigger Specificity}, strictly confining execution to target conditions (e.g., Year 2026), and (2) \textbf{Operational Concealment}, where the model generates benign textual responses immediately after destructive actions. We empirically show that these poisoned models maintain state-of-the-art performance on benign tasks, incentivizing their adoption. Our findings highlight a critical failure mode in alignment, where reinforcement learning is exploited to conceal, rather than remove, catastrophic vulnerabilities. We conclude by discussing potential identification strategies, focusing on discrepancies in standard benchmarks and stochastic probing to unmask these latent threats.

休眠细胞：向使用工具的LLMs注入潜在恶意时序后门 / Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs

1️⃣ 一句话总结

这篇论文提出了一种新型的、极其隐蔽的攻击方法，通过分阶段微调技术，可以在保持模型正常功能的同时，向使用外部工具的大语言模型中植入一个‘休眠’后门，该后门仅在特定未来时间等触发条件下才会激活并执行恶意操作，且事后会伪装成正常响应，从而逃避常规安全检查。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要