← 返回列表

arXiv 提交日期: 2026-02-25

📄 Abstract - Muon+: Towards Better Muon via One Additional Normalization Step

The Muon optimizer has demonstrated promising performance in pre-training large language models through gradient (or momentum) orthogonalization. In this work, we propose a simple yet effective enhancement to Muon, namely Muon+, which introduces an additional normalization step after orthogonalization. We demonstrate the effectiveness of Muon+ through extensive pre-training experiments across a wide range of model scales and architectures. Our evaluation includes GPT-style models ranging from 130M to 774M parameters and LLaMA-style models ranging from 60M to 1B parameters. We comprehensively evaluate the effectiveness of Muon+ in the compute-optimal training regime and further extend the token-to-parameter (T2P) ratio to an industrial level of $\approx 200$. Experimental results show that Muon+ provides a consistent boost on training and validation perplexity over Muon. We provide our code here: this https URL.

顶级标签: model training machine learning llm

Muon+：通过一个额外的归一化步骤改进Muon优化器 / Muon+: Towards Better Muon via One Additional Normalization Step

1️⃣ 一句话总结

这篇论文提出了一种名为Muon+的改进版优化器，它在原有Muon优化器的梯度正交化步骤后增加了一个归一化步骤，从而在各种规模和架构的大语言模型预训练中，都能稳定地提升训练效果和验证性能。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2602.21545

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要