📄
Abstract - Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization
Weight-space geometry plays a central role in neural network optimization, yet manifold constraints are often applied uniformly across all weight matrices. In this work, we ask whether different transformer modules prefer different manifold geometries. We study Manifold Muon for GPT-2 pretraining and compare layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks. Our results show a clear asymmetry: constraining attention layers with Stiefel geometry while assigning DGram geometry to MLP layers gives the best performance among the tested configurations, whereas the inverted assignment and all-DGram configuration become unstable under the shared hyperparameter setting. We trace this failure to singular value growth in DGram-constrained attention weights, which can amplify attention logits and induce softmax saturation. These findings suggest that symmetry-aware and geometry-aware optimization for transformers should be module-specific rather than uniform.
不同层,不同流形:Transformer优化中模块级权重空间几何 /
Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization
1️⃣ 一句话总结
本文发现,在训练GPT-2这类Transformer模型时,对注意力模块和MLP模块分别施加不同类型的几何约束(注意力用Stiefel流形、MLP用DGram流形)能取得最佳效果,而统一使用同一种约束则会导致训练不稳定,原因是DGram约束会使注意力权重的奇异值增长进而破坏注意力机制的正常工作。