菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-04
📄 Abstract - RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning

Large Reasoning Models (LRMs) have achieved tremendous success with their chain-of-thought (CoT) reasoning, yet also face safety issues similar to those of basic language models. In particular, while algorithms are designed to guide them to deliberately refuse harmful prompts with safe reasoning, this process often fails to generalize against diverse and complex jailbreak attacks. In this work, we attribute these failures to the generalization of the safe reasoning process, particularly their insufficiency against complex attack prompts. We provide both theoretical and empirical evidence to show the necessity of a more sufficient safe reasoning process to defend against advanced attack prompts. Building on this insight, we propose a Risk-Aware Preference Optimization (RAPO) framework that enables LRM to adaptively identify and address the safety risks with appropriate granularity in its thinking content. Extensive experiments demonstrate that RAPO successfully generalizes multiple LRMs' safe reasoning adaptively across diverse attack prompts whilst preserving general utility, contributing a robust alignment technique for LRM safety. Our code is available at this https URL.

顶级标签: llm agents model training
详细标签: preference optimization safe reasoning jailbreak attacks risk-aware alignment chain-of-thought 或 搜索:

RAPO:面向可泛化安全推理的风险感知偏好优化 / RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning


1️⃣ 一句话总结

本文提出了一种名为RAPO的风险感知优化框架,通过让大型推理模型在思考过程中动态识别并应对不同级别的安全风险,有效提升了模型抵御各种复杂诱导攻击的能力,同时保持了其通用任务性能。

源自 arXiv: 2602.04224