为多语言策略优化学习路由语言 / Learning to Route Languages for Multilingual Policy Optimization
1️⃣ 一句话总结
本文提出了一种名为语言路由策略优化的新方法,通过将语言视为可选择的变量,并利用多臂老虎机算法动态决定在强化学习中探索哪些语言,从而在有限的计算资源下更有效地利用多语言数据提升大模型的跨语言表现。
Large language models~(LLMs) are trained on heterogeneous multilingual corpora, yet existing policy optimization methods often implicitly restrict each training question to a single response language or rely on a fixed dominant language for supervision. We propose language-routed policy optimization (LRPO), an online policy optimization framework that treats language as a selectable variable. LRPO elicits multilingual rollouts for each training question and integrates their relative quality into preference-based policy updates, increasing the diversity and informativeness of training signals under the fixed rollout budget. To adaptively determine which languages to explore during reinforcement learning, we introduce a trainable language router formulated as a multi-armed bandit, balancing exploration of underutilized languages with exploitation of more informative ones. Extensive experiments show that LRPO consistently improves multilingual performance, demonstrating that adaptive language routing enables effective cross-lingual knowledge exploitation for training. We release all the resources at this https URL.
为多语言策略优化学习路由语言 / Learning to Route Languages for Multilingual Policy Optimization
本文提出了一种名为语言路由策略优化的新方法,通过将语言视为可选择的变量,并利用多臂老虎机算法动态决定在强化学习中探索哪些语言,从而在有限的计算资源下更有效地利用多语言数据提升大模型的跨语言表现。
源自 arXiv: 2605.25360