菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-24
📄 Abstract - PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models

Safety alignment of large language models (LLMs) typically depends on high-quality supervision data, such as safe demonstrations or preference pairs. However, in real-world deployment, emerging safety requirements are often specified as natural-language policies, while corresponding supervision data may be costly, delayed, or unavailable. This creates a mismatch between rapidly evolving safety policies and conventional data-driven alignment methods. To address this, we propose PolicyAlign, a simple yet effective framework for directly aligning LLMs with safety policies. Given a safety policy, PolicyAlign first synthesizes policy-violating instructions and then performs on-policy self-distillation to internalize policy-guided behavior. To improve training stability and data efficiency, we further introduce Policy-Sensitive Filtering, which selects instructions where the policy induces the largest behavioral shift. Experiments across multiple models show that PolicyAlign consistently improves safety while maintaining low over-refusal and preserving general capabilities. PolicyAlign also generalizes to medical, legal, and financial safety scenarios, highlighting its potential as a scalable and maintainable approach to policy-based LLM safety alignment. The code is released at this https URL.

顶级标签: llm model training model evaluation
详细标签: safety alignment on-policy self-distillation policy adaptation over-refusal data efficiency 或 搜索:

PolicyAlign: 基于策略的大语言模型直接安全对齐方法 / PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models


1️⃣ 一句话总结

本文提出了一种名为PolicyAlign的框架,它无需依赖昂贵的人工标注数据,而是通过将自然语言形式的安全策略直接转化为模型自身的训练信号,让大语言模型学会自行规避违反规则的行为,从而在保持原有能力的同时显著提升安全性。

源自 arXiv: 2606.25442