菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-05
📄 Abstract - Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.

顶级标签: llm model training agents
详细标签: reinforcement learning post-training multi-task learning policy optimization reasoning 或 搜索:

多任务GRPO:跨任务的可靠大语言模型推理 / Multi-Task GRPO: Reliable LLM Reasoning Across Tasks


1️⃣ 一句话总结

这项研究提出了一种名为MT-GRPO的新算法,通过动态调整任务权重和引入比例保持采样器,有效解决了多任务强化学习训练中常见的性能失衡问题,从而显著提升了模型在所有任务上的最差性能,并提高了训练效率。

源自 arXiv: 2602.05547