菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-20
📄 Abstract - Towards Context-Invariant Safety Alignment for Large Language Models

Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.

顶级标签: llm machine learning
详细标签: safety alignment robustness preference optimization regularization evaluation 或 搜索:

面向大型语言模型的上下文不变安全对齐 / Towards Context-Invariant Safety Alignment for Large Language Models


1️⃣ 一句话总结

本文提出一种名为锚定不变正则化(AIR)的方法,通过将可验证的提示作为锚点,仅优化开放变体的性能,从而让大语言模型在面对不同措辞时能基于真实意图一致地拒绝有害请求,显著提升了安全行为的鲁棒性和跨场景一致性。

源自 arXiv: 2605.20994