混合专家模型中专家与注意力模块的最优计算分配:动态模型设计的可扩展法则 / Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design
1️⃣ 一句话总结
这篇论文为混合专家模型找到了一个明确的数学公式,可以像调节配方一样,根据总计算量和模型稀疏度,自动确定分配给专家模块和注意力模块的最优计算比例,从而在固定计算预算下设计出性能最好的模型。
This paper presents a novel extension of neural scaling laws to Mixture-of-Experts (MoE) models, focusing on the optimal allocation of compute between expert and attention sub-layers. As MoE architectures have emerged as an efficient method for scaling model capacity without proportionally increasing computation, determining the optimal expert-attention compute ratio becomes critical. We define the ratio $r$ as the fraction of total FLOPs per token dedicated to the expert layers versus the attention layers, and explore how this ratio interacts with the overall compute budget and model sparsity. Through extensive experiments with GPT-style MoE Transformers, we empirically find that the optimal ratio $r^*$ follows a power-law relationship with total compute and varies with sparsity. Our analysis leads to an explicit formula for $r^*$, enabling precise control over the expert-attention compute allocation. We generalize the Chinchilla scaling law by incorporating this architectural parameter, providing a new framework for tuning MoE models beyond size and data. Our findings offer practical guidelines for designing efficient MoE models, optimizing performance while respecting fixed compute budgets.
混合专家模型中专家与注意力模块的最优计算分配:动态模型设计的可扩展法则 / Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design
这篇论文为混合专家模型找到了一个明确的数学公式,可以像调节配方一样,根据总计算量和模型稀疏度,自动确定分配给专家模块和注意力模块的最优计算比例,从而在固定计算预算下设计出性能最好的模型。
源自 arXiv: 2603.10379