Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer

📄 Abstract - Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer

Matrix-based optimizers have demonstrated immense potential in training Large Language Models (LLMs), however, designing an ideal optimizer remains a formidable challenge. A superior optimizer must satisfy three core desiderata: efficiency, achieving Muon-like preconditioning to accelerate optimization; stability, strictly adhering to the scale-invariance inherent in neural networks; and speed, minimizing computational overhead. While existing methods address these aspects to varying degrees, they often fail to unify them, either incurring prohibitive computational costs like Muon, or allowing radial jitters that compromise stability like RMNP. To bridge this gap, we propose Nora, an optimizer that rigorously satisfies all three requirements. Nora achieves training stability by explicitly stabilizing weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights. Simultaneously, by leveraging the block-diagonal dominance of the Transformer Hessian, Nora effectively approximates structured preconditioning while maintaining an optimal computational complexity of $\mathcal{O}(mn)$. Furthermore, we prove that Nora is a scalable optimizer and establish its corresponding scaling theorems. With a streamlined implementation requiring only two lines of code, our preliminary experiments validate Nora as an efficient and highly promising optimizer for large-scale training.

Nora：面向可扩展矩阵优化器的归一化正交行对齐方法 / Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer

1️⃣ 一句话总结

本文提出了一种名为Nora的新型矩阵优化器，通过行向动量投影到权重正交补空间来稳定训练过程，并利用Transformer的块对角Hessian近似实现高效预条件化，从而同时满足加速优化、保持尺度不变性和降低计算开销三大需求。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要