IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

📄 Abstract - IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (this https URL) to support future research on robust instruction hierarchy.

IH挑战：一个用于提升前沿大语言模型指令层级能力的训练数据集 / IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

1️⃣ 一句话总结

这篇论文提出了一个名为IH-Challenge的训练数据集，专门用来训练大语言模型学会在接收到相互冲突的指令时，能按照预设的优先级（如系统指令高于用户指令）做出正确响应，从而有效抵御恶意攻击并提升模型的安全性，实验表明使用该数据集训练能显著提升模型在这方面的能力。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要