菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-15
📄 Abstract - HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark

Existing agent-safety evaluation has focused mainly on externally induced risks. Yet agents may still enter unsafe trajectories under benign conditions. We study this complementary but underexplored setting through the lens of \emph{intrinsic} risk, where intrinsic failures remain latent, propagate across long-horizon execution, and eventually lead to high-consequence outcomes. To evaluate this setting, we introduce \emph{non-attack intrinsic risk auditing} and present \textbf{HINTBench}, a benchmark of 629 agent trajectories (523 risky, 106 safe; 33 steps on average) supporting three tasks: risk detection, risk-step localization, and intrinsic failure-type identification. Its annotations are organized under a unified five-constraint taxonomy. Experiments reveal a substantial capability gap: strong LLMs perform well on trajectory-level risk detection, but their performance drops to below 35 Strict-F1 on risk-step localization, while fine-grained failure diagnosis proves even harder. Existing guard models transfer poorly to this setting. These findings establish intrinsic risk auditing as an open challenge for agent safety.

顶级标签: agents benchmark model evaluation
详细标签: agent safety intrinsic risk trajectory analysis risk auditing failure diagnosis 或 搜索:

HINTBench:面向智能体长期内在非攻击性轨迹风险的基准测试 / HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark


1️⃣ 一句话总结

这篇论文提出了一个名为HINTBench的新基准测试,专门用于评估智能体在正常环境下因自身内部决策失误(而非外部攻击)而逐渐累积并最终导致严重后果的长期风险,揭示了当前先进模型在精准定位风险步骤和诊断失败原因方面仍存在巨大挑战。

源自 arXiv: 2604.13954