Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

📄 Abstract - Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

Extending large language models (LLMs) to low-resource languages often incurs an "alignment tax": improvements in the target language come at the cost of catastrophic forgetting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimization (GRPO), where the model is optimized using embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flexible realizations, enabling controlled updates that reduce destructive interference with pretrained knowledge. We evaluate our approach on Tibetan-Chinese machine translation and Tibetan headline generation. Experiments show that our method acquires low-resource capabilities while markedly mitigating alignment tax, preserving general competence more effectively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher semantic quality and preference in open-ended generation, and few-shot transfer results indicate that it learns more transferable and robust representations under limited supervision. Overall, our study demonstrates that reinforcement learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.

基于语义奖励的强化学习实现低资源语言扩展且无对齐代价 / Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

1️⃣ 一句话总结

该研究提出一种用语义奖励代替传统文本匹配的强化学习方法，让大模型在扩展低资源语言能力时，既能学会新语言任务，又不会忘记已有的通用知识，解决了常见微调方法中‘学会新语言就丢失原有能力’的难题。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要