完成度与最优性:长期累积损伤问题中的策略梯度方法 / Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems
1️⃣ 一句话总结
这篇论文研究了在长期决策中,当短期有利行为会导致长期累积负面后果时,策略梯度方法可能出现的两种失败模式:无法完成任务(完成度问题)和虽能完成但非最优(最优性问题),并通过分解方法和两个实际案例(砌砖工和NBA球员职业生涯)验证了这四种可预测的现象。
Long-horizon decision problems with cumulative damage couple locally attractive actions to globally adverse outcomes. We identify two orthogonal failure modes for policy-gradient methods on this class and propose a decomposition that separates them: \emph{completion} (reaching the terminal horizon rather than exiting via an implicit terminal constraint) and \emph{optimality} (matching the dynamic-programming reference given completion). Under PPO with a linear soft penalty, granting horizon access alone reduces the completion rate: the penalty's equilibrium drives the dominant-activity share to zero, while action-space restriction combined with horizon access achieves completion but leaves an optimality gap ($\Delta M_{\text{final}} = 0.271$) that we trace to first-phase greedy commitment at the damage origin. We derive four testable predictions and evaluate them in two separately calibrated environments that share the same abstract structure but differ in domain, horizon, activity set, and calibration data: a 49-step bricklayer career and a 20-season NBA power-forward career. All four predictions replicate qualitatively. The horizon-invariance prediction is met at three of four tested horizons, with the exception at $H = 15$ consistent with the $H^*$ boundary ($H^* \in [6, 14]$ under the NBA parameters).
完成度与最优性:长期累积损伤问题中的策略梯度方法 / Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems
这篇论文研究了在长期决策中,当短期有利行为会导致长期累积负面后果时,策略梯度方法可能出现的两种失败模式:无法完成任务(完成度问题)和虽能完成但非最优(最优性问题),并通过分解方法和两个实际案例(砌砖工和NBA球员职业生涯)验证了这四种可预测的现象。
源自 arXiv: 2605.26657