廉价奖励黑客检测 / Cheap Reward Hacking Detection
1️⃣ 一句话总结
本文提出了一种低成本检测奖励黑客行为的方法,通过训练一个小型Transformer编码器将游戏轨迹映射到嵌入空间,并用线性探针识别异常,在几乎不增加计算成本的情况下,性能可与昂贵的语言模型评判方法相媲美。
A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the $L_1$ distance between reward and metadata signals. A linear probe on top of that embedding detects reward hacking on the cleaned test split with AUC $0.9467$ and TPR@5%FPR $0.8296$, matching the TW sanitized LLM-as-judge AUC ($0.9510$ on the cleaned split) and exceeding its TPR@5%FPR ($0.7130$ vs $0.8296$) on the same information condition, at roughly four orders of magnitude lower per-trajectory cost. The encoder is not a pure behavior reader: stripping natural-language reasoning from its input at probe time drops AUC to $0.6213$.
廉价奖励黑客检测 / Cheap Reward Hacking Detection
本文提出了一种低成本检测奖励黑客行为的方法,通过训练一个小型Transformer编码器将游戏轨迹映射到嵌入空间,并用线性探针识别异常,在几乎不增加计算成本的情况下,性能可与昂贵的语言模型评判方法相媲美。
源自 arXiv: 2606.08893