菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-23
📄 Abstract - BarrierSteer: LLM Safety via Learning Barrier Steering

Despite the state-of-the-art performance of large language models (LLMs) across diverse tasks, their susceptibility to adversarial attacks and unsafe content generation remains a major obstacle to deployment, particularly in high-stakes settings. Addressing this challenge requires safety mechanisms that are both practically effective and supported by rigorous theory. We introduce BarrierSteer, a novel framework that formalizes response safety by embedding learned non-linear safety constraints directly into the model's latent representation space. BarrierSteer employs a steering mechanism based on Control Barrier Functions (CBFs) to efficiently detect and prevent unsafe response trajectories during inference with high precision. By enforcing multiple safety constraints through efficient constraint merging, without modifying the underlying LLM parameters, BarrierSteer preserves the model's original capabilities and performance. We provide theoretical results establishing that applying CBFs in latent space offers a principled and computationally efficient approach to enforcing safety. Our experiments across multiple models and datasets show that BarrierSteer substantially reduces adversarial success rates, decreases unsafe generations, and outperforms existing methods.

顶级标签: llm model evaluation systems
详细标签: safety alignment control barrier functions adversarial robustness latent space steering constraint enforcement 或 搜索:

BarrierSteer:通过屏障转向学习实现大语言模型安全 / BarrierSteer: LLM Safety via Learning Barrier Steering


1️⃣ 一句话总结

这篇论文提出了一种名为BarrierSteer的新方法,它通过在学习到的模型内部表示空间中设置‘安全屏障’,在不改动模型本身的情况下,高效地检测并阻止大语言模型生成有害或不安全的内容,从而显著提升了模型的安全性。

源自 arXiv: 2602.20102