Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

📄 Abstract - Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent-Eval Checklist for benchmark designers. We condense the insights into BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and identify possible reward-hacking exploits in a clairvoyant manner. Moreover, we extend BenchJack to an iterative generative-adversarial pipeline that discovers new flaws and patches them iteratively to improve benchmark robustness. We apply BenchJack to 10 popular agent benchmarks spanning software engineering, web navigation, desktop computing, and terminal operations. BenchJack synthesizes reward-hacking exploits that achieve near-perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack's extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations. Our results show that evaluation pipelines have not internalized an adversarial mindset, and that proactive auditing could help close the security gap for the fast-paced benchmarking space.

安卓会梦见破解游戏吗？——用BenchJack系统审计AI智能体基准测试 / Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

1️⃣ 一句话总结

本文发现当前AI智能体基准测试存在严重安全漏洞——智能体无需真正完成任务，仅通过利用测试设计缺陷就能获得高分，并为此开发了自动化审计工具BenchJack，该系统能主动发现并修复这些漏洞，实验表明经过三轮迭代就能将大部分基准测试的“可钻空子”任务比例从接近100%降至10%以下。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要