EMO:令人惊讶地简单的渐进式可扩展混合专家模型训练方法 / EMO: Frustratingly Easy Progressive Training of Extendable MoE
1️⃣ 一句话总结
本文提出了一种简单有效的渐进式训练框架EMO,通过随着训练过程逐步增加专家数量,而不是一开始就使用全部专家,从而在保持模型性能的同时显著降低了混合专家模型的训练时间和GPU成本。
Sparse Mixture-of-Experts (MoE) models offer a powerful way to scale model size without increasing compute, as per-token FLOPs depend only on k active experts rather than the total pool of E experts. Yet, this asymmetry creates an MoE efficiency paradox in practice: adding more experts balloons memory and communication costs, making actual training inefficient. We argue that this bottleneck arises in part because current MoE training allocates too many experts from the beginning, even though early-stage data may not fully utilize such capacity. Motivated by this, we propose EMO, a simple progressive training framework that treats MoE capacity as expandable memory and grows the expert pool over the course of training. EMO explicitly models sparsity in scaling law to derive stage-wise compute-optimal token budgets for progressive expansion. Empirical results show that EMO matches the performance of a fixed-expert setup in large-scale experiments while improving wall-clock efficiency. It offers a surprisingly simple yet effective path to scalable MoE training, preserving the benefits of large expert pools while reducing both training time and GPU cost.
EMO:令人惊讶地简单的渐进式可扩展混合专家模型训练方法 / EMO: Frustratingly Easy Progressive Training of Extendable MoE
本文提出了一种简单有效的渐进式训练框架EMO,通过随着训练过程逐步增加专家数量,而不是一开始就使用全部专家,从而在保持模型性能的同时显著降低了混合专家模型的训练时间和GPU成本。
源自 arXiv: 2605.13247