← 返回列表

arXiv 提交日期: 2026-06-08

📄 Abstract - Cheap Reward Hacking Detection

A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the $L_1$ distance between reward and metadata signals. A linear probe on top of that embedding detects reward hacking on the cleaned test split with AUC $0.9467$ and TPR@5%FPR $0.8296$, matching the TW sanitized LLM-as-judge AUC ($0.9510$ on the cleaned split) and exceeding its TPR@5%FPR ($0.7130$ vs $0.8296$) on the same information condition, at roughly four orders of magnitude lower per-trajectory cost. The encoder is not a pure behavior reader: stripping natural-language reasoning from its input at probe time drops AUC to $0.6213$.

顶级标签: reinforcement learning llm machine learning

廉价奖励黑客检测 / Cheap Reward Hacking Detection

1️⃣ 一句话总结

本文提出了一种低成本检测奖励黑客行为的方法，通过训练一个小型Transformer编码器将游戏轨迹映射到嵌入空间，并用线性探针识别异常，在几乎不增加计算成本的情况下，性能可与昂贵的语言模型评判方法相媲美。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2606.08893

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要