菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-11
📄 Abstract - Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.

顶级标签: llm model evaluation agents
详细标签: safety alignment jailbreak defense reasoning models inference-time intervention steering vectors 或 搜索:

安全恢复离推理模型仅几步之遥:早期干预即可实现 / Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away


1️⃣ 一句话总结

这篇论文提出了一种名为SafeThink的轻量级防御方法,它通过在推理过程中早期检测并注入简短的安全提示,就能有效降低大模型被恶意攻击的风险,同时不损害其原有的推理能力。

源自 arXiv: 2602.11096