菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-04
📄 Abstract - When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

Bring-Your-Own-Key (BYOK) agent architectures let users route LLM traffic through third-party relays, creating a critical integrity gap: a malicious relay can modify an aligned LLM response after generation but before agent execution. We formalize this post-alignment tampering threat and show that, without end-to-end integrity, the relay can observe, suppress, or replace downstream messages, making even perfectly aligned LLMs ineffective against such attacks. We instantiate this threat as the Relay Tampering Attack (RTA), which performs multi-round strategic rewriting, minimal security-critical edits, and stealth restoration by resubmitting tampered outputs to the upstream LLM. Across AgentDojo and ASB with six LLMs, RTA achieves up to 99.1% attack success, outperforming prompt-injection baselines with modest overhead. Case studies on OpenClaw and Claude Code demonstrate real-world feasibility, and evaluations of four defenses show that none fully prevent RTA. Finally, we propose a time-based detection defense that mitigates RTA while preserving agent utility.

顶级标签: llm agents security
详细标签: llm agents attack tampering security defense 或 搜索:

当对齐不足够:针对LLM代理的响应路径攻击 / When Alignment Isn't Enough: Response-Path Attacks on LLM Agents


1️⃣ 一句话总结

本文发现了一个严重的安全漏洞:在使用第三方服务传递大型语言模型(LLM)的回复时,即使模型本身已经被很好地对齐(确保安全),恶意中转方仍然可以在模型生成回答后、交给代理执行前,偷偷修改或替换答复内容,从而实现高成功率(最高99.1%)的攻击;作者将其称为“中继篡改攻击”,并测试了多种防御措施,发现目前仅有一种基于时间的检测方法能在不损害代理功能的情况下有效缓解此威胁。

源自 arXiv: 2605.02187