菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-04
📄 Abstract - A decoupled diffusion planner that adapts to changing cost limits by using cost-conditioned generation for safety and reward gradients for performance

Offline safe reinforcement learning often requires policies to adapt at deployment time to safety budgets that vary across episodes or change within a single episode. While diffusion-based planners enable flexible trajectory generation, existing guidance schemes often treat reward improvement and constraint satisfaction as competing gradient objectives, which can lead to unreliable safety compliance under cost limits. We reinterpret adaptive safe trajectory generation as sampling from a constrained trajectory distribution, where the budget restricts the trajectory region, and reward shapes preferences within that region. This perspective motivates Safe Decoupled Guidance Diffusion (SDGD), which conditions classifier-free guidance on the cost limit to bias sampling toward trajectories satisfying the specified limit, while using reward-gradient guidance to refine trajectories for higher return. Because direct reward guidance can increase return while also steering samples toward trajectories with higher cumulative cost, we introduce Feasible Trajectory Relabeling (FTR) to reshape reward targets and discourage such directions. We further provide a first-order sampling-time analysis showing that FTR suppresses reward-induced cost drift under a prefix-restorative alignment condition. Extensive evaluations on the DSRL benchmark show that SDGD achieves the strongest safety compliance among baselines, satisfying the constraint on 94.7% of tasks (36/38), while obtaining the highest reward among safe methods on 21 tasks.

顶级标签: reinforcement learning machine learning
详细标签: safe reinforcement learning diffusion planner guidance methods cost constraints trajectory generation 或 搜索:

一种适应变化成本限制的解耦扩散规划器:基于成本条件生成保障安全与奖励梯度提升性能 / A decoupled diffusion planner that adapts to changing cost limits by using cost-conditioned generation for safety and reward gradients for performance


1️⃣ 一句话总结

本文提出了一种名为SDGD的智能规划方法,它通过将安全成本限制作为生成条件来确保轨迹始终满足安全要求,同时利用奖励梯度引导优化性能,解决了传统方法中安全与性能相互冲突的难题,在多数测试任务中既严格遵从安全约束又取得了最高奖励。

源自 arXiv: 2605.02777