基于零阶策略优化的高效联邦人类反馈强化学习 / Efficient Federated RLHF via Zeroth-Order Policy Optimization
1️⃣ 一句话总结
本文提出了一种名为Par-S^2ZPO的高效联邦学习算法,让资源有限的设备(如手机、传感器)也能协同进行人类反馈强化学习,它在保证学习效果的同时,大幅降低了通信和计算开销,比现有方法更快更好。
This paper considers reinforcement learning from human feedback in a federated learning setting with resource-constrained agents, such as edge devices. We propose an efficient federated RLHF algorithm, named Partitioned, Sign-based Stochastic Zeroth-order Policy Optimization (Par-S$^2$ZPO). The algorithm is built on zeroth-order optimization with binary perturbation, resulting in low communication, computation, and memory complexity by design. Our theoretical analysis establishes an upper bound on the convergence rate of Par-S$^2$ZPO, revealing that it is as efficient as its centralized counterpart in terms of sample complexity but converges faster in terms of policy update iterations. Our experimental results show that it outperforms a FedAvg-based RLHF on four MuJoCo RL tasks.
基于零阶策略优化的高效联邦人类反馈强化学习 / Efficient Federated RLHF via Zeroth-Order Policy Optimization
本文提出了一种名为Par-S^2ZPO的高效联邦学习算法,让资源有限的设备(如手机、传感器)也能协同进行人类反馈强化学习,它在保证学习效果的同时,大幅降低了通信和计算开销,比现有方法更快更好。
源自 arXiv: 2604.17747