MiMo-V2-Flash 技术报告 / MiMo-V2-Flash Technical Report
1️⃣ 一句话总结
这篇论文介绍了一个名为MiMo-V2-Flash的高效大型语言模型,它通过创新的专家混合结构和训练方法,在参数更少的情况下实现了与顶尖开源模型相媲美的推理和智能体能力,并且推理速度更快。
We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.
MiMo-V2-Flash 技术报告 / MiMo-V2-Flash Technical Report
这篇论文介绍了一个名为MiMo-V2-Flash的高效大型语言模型,它通过创新的专家混合结构和训练方法,在参数更少的情况下实现了与顶尖开源模型相媲美的推理和智能体能力,并且推理速度更快。
源自 arXiv: 2601.02780