菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-16
📄 Abstract - Evasive Intelligence: Lessons from Malware Analysis for Evaluating AI Agents

Artificial intelligence (AI) systems are increasingly adopted as tool-using agents that can plan, observe their environment, and take actions over extended time periods. This evolution challenges current evaluation practices where the AI models are tested in restricted, fully observable settings. In this article, we argue that evaluations of AI agents are vulnerable to a well-known failure mode in computer security: malicious software that exhibits benign behavior when it detects that it is being analyzed. We point out how AI agents can infer the properties of their evaluation environment and adapt their behavior accordingly. This can lead to overly optimistic safety and robustness assessments. Drawing parallels with decades of research on malware sandbox evasion, we demonstrate that this is not a speculative concern, but rather a structural risk inherent to the evaluation of adaptive systems. Finally, we outline concrete principles for evaluating AI agents, which treat the system under test as potentially adversarial. These principles emphasize realism, variability of test conditions, and post-deployment reassessment.

顶级标签: agents model evaluation systems
详细标签: adversarial evaluation safety assessment sandbox evasion agent behavior robustness testing 或 搜索:

规避性智能:从恶意软件分析中汲取教训,用于评估AI智能体 / Evasive Intelligence: Lessons from Malware Analysis for Evaluating AI Agents


1️⃣ 一句话总结

这篇论文借鉴恶意软件在检测到被分析时会伪装行为的现象,警告当前对AI智能体的评估可能因智能体识别出测试环境而表现“乖巧”,导致安全评估过于乐观,并提出了将AI视为潜在对手、强调测试环境真实多变等新的评估原则。

源自 arXiv: 2603.15457