Defending against Adaptive Prompt Injection Attacks via Reasoning-enabled Task Alignment

📄 Abstract - Defending against Adaptive Prompt Injection Attacks via Reasoning-enabled Task Alignment

Indirect prompt injection attacks hijack LLM-based agents by embedding malicious instructions in third-party data that the agent retrieves during task execution. Existing defenses report near-zero attack success rate on static benchmarks, yet recent adaptive evaluations show that these results collapse once the attacker is allowed to optimize against the deployed defense. In this work, we trace this collapse to two failure modes. First, existing defense methods are confined to recognizing specific attack patterns, rather than assessing whether the intent of every embedded instruction is relevant to the user task. Second, training-based defenses, which otherwise offer the strongest safety-utility trade-off, assemble their adversarial examples from a handful of hand-crafted templates, and the resulting defender fails to generalize outside that narrow strategy distribution. To address these gaps, we propose RETA, a training-based method that grounds defense decisions on the user tasks rather than attacker-controlled data. At each tool-output step, the defender undertakes chain-of-thought reasoning verifying that its actions are consistent with the user task. Leveraging red-teaming, a simulated attacker synthesizes adversarial training data and receives a dictionary-learning diversity reward, achieving broad coverage of injection-reformulation strategies. Together, these allow the defender to be optimized via multi-objective reinforcement learning and achieve better safety-utility trade-off. Across six black-box adaptive attacks, RETA keeps every per-attack ASR below 10%, with average ASR of 2.92% and 3.75% on the two target models, while preserving most utility under attack and on clean inputs.

基于推理的任务对齐：对抗自适应提示注入攻击的防御方法 / Defending against Adaptive Prompt Injection Attacks via Reasoning-enabled Task Alignment

1️⃣ 一句话总结

本文提出一种名为RETA的训练方法，通过让AI助手在每次执行任务时先进行逻辑推理，判断外来指令是否与用户原始任务一致，从而有效抵御那些经过专门优化的复杂注入攻击，将攻击成功率控制在10%以下，同时保持较好的任务性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要