APC-RL: Exceeding Data-Driven Behavior Priors with Adaptive Policy Composition

📄 Abstract - APC-RL: Exceeding Data-Driven Behavior Priors with Adaptive Policy Composition

Incorporating demonstration data into reinforcement learning (RL) can greatly accelerate learning, but existing approaches often assume demonstrations are optimal and fully aligned with the target task. In practice, demonstrations are frequently sparse, suboptimal, or misaligned, which can degrade performance when these demonstrations are integrated into RL. We propose Adaptive Policy Composition (APC), a hierarchical model that adaptively composes multiple data-driven Normalizing Flow (NF) priors. Instead of enforcing strict adherence to the priors, APC estimates each prior's applicability to the target task while leveraging them for exploration. Moreover, APC either refines useful priors, or sidesteps misaligned ones when necessary to optimize downstream reward. Across diverse benchmarks, APC accelerates learning when demonstrations are aligned, remains robust under severe misalignment, and leverages suboptimal demonstrations to bootstrap exploration while avoiding performance degradation caused by overly strict adherence to suboptimal demonstrations.

APC-RL：通过自适应策略组合超越数据驱动的行为先验 / APC-RL: Exceeding Data-Driven Behavior Priors with Adaptive Policy Composition

1️⃣ 一句话总结

这篇论文提出了一种名为自适应策略组合（APC）的分层强化学习方法，它能够智能地利用可能不完美或与任务不完全匹配的演示数据来加速学习，在数据有用时加以利用和优化，在数据有偏差时则灵活规避，从而在各种数据质量下都能实现稳健且高效的学习。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要