ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

📄 Abstract - ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

Current large reasoning models (LRMs) have shown strong ability on challenging tasks after reinforcement learning (RL) based post-training. However, previous work mainly focuses on English reasoning in expectation of the strongest performance, despite the demonstrated potential advantage of multilingual thinking, as well as the requirement for native thinking traces by global users. In this paper, we propose ExpLang, a novel LLM post-training pipeline that enables on-policy thinking language selection to improve exploration and exploitation during RL with the use of multiple languages. The results show that our method steadily outperforms English-only training with the same training budget, while showing high thinking language compliance for both seen and unseen languages. Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged non-English advantage. The method is orthogonal to most RL algorithms and opens up a new perspective on using multilinguality to improve LRMs.

ExpLang：通过策略性思考语言选择改进大语言模型推理中的探索与利用 / ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection

1️⃣ 一句话总结

这篇论文提出了一种名为ExpLang的新方法，它允许大语言模型在强化学习训练过程中自主选择使用哪种语言进行内部思考，从而通过利用多语言优势来提升模型的推理能力和最终表现。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要