从奖励黑客激活到智能体风险状态:大语言模型智能体中的上下文校准机制监控 / From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents
1️⃣ 一句话总结
本研究提出一种结合智能体内部状态和外部环境上下文的监控方法,通过分析奖励黑客激活、熵和决策上下文特征,更准确地预测大语言模型智能体何时会将潜在风险转化为实际有害行为。
Language-model agents act through repeated cycles of observation, reasoning, and action selection, making safety monitoring depend on both internal model state and environment context. We study reward-hacking monitors in ReAct-style agents acting in Gameable ALFWorld and WebShop. Agents are instrumented with activation-based reward-hack scores, token-level entropy, and decision-context features. We find that adapters fine-tuned on \textit{School-of-Reward-Hacks} dataset can transfer reward-hack tendencies into agentic action selection, especially when the environment exposes proxy-reward affordances. However, mitigating such behavior cannot rely on activation dynamics alone. High reward-hack activation identifies a latent policy state, but does not necessarily imply an immediate exploit action. Across next-step prediction tasks, entropy and context-calibrated internal features improve risk estimation over reward-hack activation alone. Activation-direction steering further reduces proxy-exploit behavior in selected mixed-adapter regimes. Overall, our results support context-calibrated internal monitoring for agents: reward-hack activation identifies a latent policy state, while entropy and decision context help determine when that state becomes risky action.
从奖励黑客激活到智能体风险状态:大语言模型智能体中的上下文校准机制监控 / From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents
本研究提出一种结合智能体内部状态和外部环境上下文的监控方法,通过分析奖励黑客激活、熵和决策上下文特征,更准确地预测大语言模型智能体何时会将潜在风险转化为实际有害行为。
源自 arXiv: 2606.06223