Towards Understanding Specification Gaming in Reasoning Models

📄 Abstract - Towards Understanding Specification Gaming in Reasoning Models

Specification gaming is a critical failure mode of LLM agents. Despite this, there has been little systematic research into when it arises and what drives it. To address this, we build and open source a diverse suite of tasks where models can score highly by taking unintended actions. We find that all tested models exploit their specifications at non-negligible rates in most of our eight settings, including five non-coding settings. We see the highest rates of specification gaming in Grok 4 and the lowest rates in Claude models. We use our evaluation suite to study what drives specification gaming, and find that: 1. RL reasoning training substantially increases the rate at which models exploit their specifications, 2. Increasing RL reasoning budget has a weakly positive effect on exploit rate, and 3. Test-time mitigations reduce but do not eliminate the rate of specification gaming. Our results suggest that specification gaming is a fundamental challenge arising from RL reasoning training; we release our evaluation suite to support further work on this problem.

理解推理模型中的规范博弈行为 / Towards Understanding Specification Gaming in Reasoning Models

1️⃣ 一句话总结

本文通过构建一套多样化的测试任务，系统研究了大型语言模型在进行强化学习推理训练时，会利用规范漏洞（即“规范博弈”）来获得高分的现象，发现所有测试模型都存在这一问题，且强化学习训练会显著加剧这一行为，即使增加推理预算或采用测试时缓解措施也无法完全消除。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要