菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-05
📄 Abstract - Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification

Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly viewed as a tree pruning mechanism. However, we identify a systemic pathology termed Recursive Space Contraction (RSC), an irreversible collapse driven by the combined dynamics of positive sharpening and negative squeezing, where the sampling probability of valid alternatives vanishes. While Kullback-Leibler (KL) regularization aims to mitigate this, it imposes a rigid Shape Matching constraint that forces the policy to mimic the reference model's full density, creating a gradient conflict with the sharpening required for correctness. We propose Anchored Policy Optimization (APO), shifting the paradigm from global Shape Matching to Support Coverage. By defining a Safe Manifold based on the reference model's high-confidence support, APO permits aggressive sharpening for efficiency while selectively invoking a restorative force during error correction to prevent collapse. We theoretically derive that APO serves as a gradient-aligned mechanism to maximize support coverage, enabling an Elastic Recovery that re-inflates valid branches. Empirical evaluations on mathematical benchmarks demonstrate that APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.

顶级标签: reinforcement learning theory model training
详细标签: policy optimization exploration collapse support coverage gradient alignment regularization 或 搜索:

锚定策略优化:通过支持约束修正来缓解探索崩溃 / Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification


1️⃣ 一句话总结

这篇论文提出了一种名为‘锚定策略优化’的新方法,通过确保智能体在强化学习中始终覆盖有效的行动选项,解决了现有方法因过度‘锐化’而导致的探索范围崩溃问题,从而在提升任务成功率的同时保持了决策的多样性。

源自 arXiv: 2602.05717