菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-15
📄 Abstract - Differentiable Evolutionary Reinforcement Learning

The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the "meta-gradient" of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.

顶级标签: reinforcement learning model training agents
详细标签: reward shaping meta-learning evolutionary algorithms bilevel optimization autonomous agents 或 搜索:

可微分进化强化学习 / Differentiable Evolutionary Reinforcement Learning


1️⃣ 一句话总结

这篇论文提出了一种名为DERL的新方法,它能够像训练智能体一样,自动学习和优化奖励函数本身,从而让AI在复杂的推理任务中更高效地学会如何给自己设定更好的目标。


源自 arXiv: 2512.13399