📄 论文总结
Motif-2-12.7B 技术报告 / Motif 2 12.7B technical report
1️⃣ 一句话总结
这篇论文介绍了一个名为Motif-2-12.7B的新型高效开源大语言模型,它通过创新的分组差分注意力架构和系统级优化,在有限计算资源下实现了与更大模型相媲美的强大语言理解和指令执行能力。
We introduce Motif-2-12.7B, a new open-weight foundation model that pushes the efficiency frontier of large language models by combining architectural innovation with system-level optimization. Designed for scalable language understanding and robust instruction generalization under constrained compute budgets, Motif-2-12.7B builds upon Motif-2.6B with the integration of Grouped Differential Attention (GDA), which improves representational efficiency by disentangling signal and noise-control attention pathways. The model is pre-trained on 5.5 trillion tokens spanning diverse linguistic, mathematical, scientific, and programming domains using a curriculum-driven data scheduler that gradually changes the data composition ratio. The training system leverages the MuonClip optimizer alongside custom high-performance kernels, including fused PolyNorm activations and the Parallel Muon algorithm, yielding significant throughput and memory efficiency gains in large-scale distributed environments. Post-training employs a three-stage supervised fine-tuning pipeline that successively enhances general instruction adherence, compositional understanding, and linguistic precision. Motif-2-12.7B demonstrates competitive performance across diverse benchmarks, showing that thoughtful architectural scaling and optimized training design can rival the capabilities of much larger models.
Motif-2-12.7B 技术报告 / Motif 2 12.7B technical report
这篇论文介绍了一个名为Motif-2-12.7B的新型高效开源大语言模型,它通过创新的分组差分注意力架构和系统级优化,在有限计算资源下实现了与更大模型相媲美的强大语言理解和指令执行能力。