📄
Abstract - On the "Causality" Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go
In introductory presentations of policy gradients, one often derives the REINFORCE estimator using the full trajectory return and then states, by ``causality,'' that the full return may be replaced by the reward-to-go. Although this statement is correct, it is frequently presented at a level of rigor that leaves unclear where the past-reward terms disappear. This short paper isolates that step and gives a mathematically explicit derivation based on prefix trajectory distributions and the score-function identity. The resulting account does not change the estimator. Its contribution is conceptual: instead of presenting reward-to-go as a post hoc unbiased replacement for full return, it shows that reward-to-go arises directly once the objective is decomposed over prefix trajectories. In this formulation, the usual causality argument is recovered as a corollary of the derivation rather than as an additional heuristic principle.
论策略梯度推导中的“因果性”步骤:关于全回报与“奖励累计”的教学法调和 /
On the "Causality" Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go
1️⃣ 一句话总结
这篇论文通过引入前缀轨迹分布和得分函数恒等式,为策略梯度推导中从‘全回报’到‘奖励累计’的转换提供了一个清晰、严谨的数学解释,从而将通常被视为事后启发式原则的‘因果性’论证转化为推导的自然推论。