异构智能体协同强化学习 / Heterogeneous Agent Collaborative Reinforcement Learning
1️⃣ 一句话总结
这篇论文提出了一种名为HACRL的新学习范式,让不同类型的人工智能体在训练时互相分享经验、共同进步,但在实际应用时仍能独立工作,从而在保证效率的同时显著提升了所有参与智能体的性能。
We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional teacher-to-student transfer. Building on this paradigm, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation and optimization correctness. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO by an average of 3.3\% while using only half the rollout cost.
异构智能体协同强化学习 / Heterogeneous Agent Collaborative Reinforcement Learning
这篇论文提出了一种名为HACRL的新学习范式,让不同类型的人工智能体在训练时互相分享经验、共同进步,但在实际应用时仍能独立工作,从而在保证效率的同时显著提升了所有参与智能体的性能。
源自 arXiv: 2603.02604