On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models

📄 Abstract - On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models

Entropy serves as a critical metric for measuring the diversity of outputs generated by large language models (LLMs), providing valuable insights into their exploration capabilities. While recent studies increasingly focus on monitoring and adjusting entropy to better balance exploration and exploitation in reinforcement fine-tuning (RFT), a principled understanding of entropy dynamics during this process is yet to be thoroughly investigated. In this paper, we establish a theoretical framework for analyzing the entropy dynamics during the RFT process, which begins with a discriminant expression that quantifies entropy change under a single logit update. This foundation enables the derivation of a first-order expression for entropy change, which can be further extended to the update formula of Group Relative Policy Optimization (GRPO). The corollaries and insights drawn from the theoretical analysis inspire the design of entropy control methods, and also offer a unified lens for interpreting various entropy-based methods in existing studies. We provide empirical evidence to support the main conclusions of our analysis and demonstrate the effectiveness of the derived entropy-discriminator clipping methods. This study yields novel insights into RFT training dynamics, providing theoretical support and practical strategies for optimizing the exploration-exploitation balance during LLM fine-tuning.

论大语言模型强化微调中的熵动态 / On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models

1️⃣ 一句话总结

这篇论文建立了一个理论框架来分析大语言模型在强化微调过程中输出多样性的变化规律，并基于此提出了控制多样性的方法，以帮助模型在微调时更好地平衡探索新答案和利用已知知识。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要