菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-06
📄 Abstract - On the "Causality" Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go

In introductory presentations of policy gradients, one often derives the REINFORCE estimator using the full trajectory return and then states, by ``causality,'' that the full return may be replaced by the reward-to-go. Although this statement is correct, it is frequently presented at a level of rigor that leaves unclear where the past-reward terms disappear. This short paper isolates that step and gives a mathematically explicit derivation based on prefix trajectory distributions and the score-function identity. The resulting account does not change the estimator. Its contribution is conceptual: instead of presenting reward-to-go as a post hoc unbiased replacement for full return, it shows that reward-to-go arises directly once the objective is decomposed over prefix trajectories. In this formulation, the usual causality argument is recovered as a corollary of the derivation rather than as an additional heuristic principle.

顶级标签: reinforcement learning theory
详细标签: policy gradient causality reinforce derivation reward-to-go 或 搜索:

论策略梯度推导中的“因果性”步骤:关于全回报与“奖励累计”的教学法调和 / On the "Causality" Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go


1️⃣ 一句话总结

这篇论文通过引入前缀轨迹分布和得分函数恒等式,为策略梯度推导中从‘全回报’到‘奖励累计’的转换提供了一个清晰、严谨的数学解释,从而将通常被视为事后启发式原则的‘因果性’论证转化为推导的自然推论。

源自 arXiv: 2604.04686