通过促进链式思维生成前的安全决策构建更安全的大型推理模型 / Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation
1️⃣ 一句话总结
这篇论文提出了一种新方法,通过在大型推理模型启动复杂的链式思维推理之前,先强制其进行安全判断,从而在保持模型强大推理能力的同时,显著提升其应对有害内容的安全性。
Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In this paper, we reveal that LRMs' safety degradation occurs only after CoT is enabled, and this degradation is not observed when CoT is disabled. This observation motivates us to consider encouraging LRMs to make safety decisions before CoT generation. To this end, we propose a novel safety alignment method that promotes the safety decision-making of LRMs before starting CoT generation. Specifically, we first utilize a Bert-based classifier to extract safety decision signals from a safe model (e.g., a CoT-disabled LRM) and then integrate these signals into LRMs' safety alignment as auxiliary supervision. In this way, the safety gradients can be backpropagated to the LRMs' latent representations, effectively strengthening the LRMs' safety decision-making abilities against CoT generation. Extensive experiments demonstrate that our method substantially improves the safety capabilities of LRMs while effectively maintaining LRMs' general reasoning performance.
通过促进链式思维生成前的安全决策构建更安全的大型推理模型 / Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation
这篇论文提出了一种新方法,通过在大型推理模型启动复杂的链式思维推理之前,先强制其进行安全判断,从而在保持模型强大推理能力的同时,显著提升其应对有害内容的安全性。
源自 arXiv: 2603.17368