菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-29
📄 Abstract - Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.

顶级标签: model training machine learning theory
详细标签: mixture of experts auxiliary loss router alignment model specialization efficient training 或 搜索:

专家-路由器耦合损失:增强混合专家模型中的路由与专家对齐 / Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss


1️⃣ 一句话总结

本文提出了一种名为专家-路由器耦合损失(ERC损失)的新型轻量级辅助损失函数,通过约束代理令牌的激活范数矩阵,有效解决了传统MoE模型中路由器决策与专家能力之间缺乏显式约束的问题,从而以可忽略的开销显著提升了模型性能。


2️⃣ 论文创新点

1. 专家-路由器耦合损失(ERC损失)

2. 基于ERC的专家专业化分析框架

3. 高效且可扩展的训练方法


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

源自 arXiv: 2512.23447