On the "Causality" Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go

📄 Abstract - On the "Causality" Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go

In introductory presentations of policy gradients, one often derives the REINFORCE estimator using the full trajectory return and then states, by ``causality,'' that the full return may be replaced by the reward-to-go. Although this statement is correct, it is frequently presented at a level of rigor that leaves unclear where the past-reward terms disappear. This short paper isolates that step and gives a mathematically explicit derivation based on prefix trajectory distributions and the score-function identity. The resulting account does not change the estimator. Its contribution is conceptual: instead of presenting reward-to-go as a post hoc unbiased replacement for full return, it shows that reward-to-go arises directly once the objective is decomposed over prefix trajectories. In this formulation, the usual causality argument is recovered as a corollary of the derivation rather than as an additional heuristic principle.

论策略梯度推导中的“因果性”步骤：关于全回报与“奖励累计”的教学法调和 / On the "Causality" Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go

1️⃣ 一句话总结

这篇论文通过引入前缀轨迹分布和得分函数恒等式，为策略梯度推导中从‘全回报’到‘奖励累计’的转换提供了一个清晰、严谨的数学解释，从而将通常被视为事后启发式原则的‘因果性’论证转化为推导的自然推论。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要