死方向调节器:面向深度网络的规范等变预处理方法 / Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks
1️⃣ 一句话总结
本文提出了一种名为DDC的优化器增强方法,通过让优化器尊重神经网络参数的内在对称性(如缩放、旋转等),避免优化轨迹在对称方向上漂移,从而在语言模型和视觉模型上显著提升训练效果,并能精确测量模型中的‘死方向’(无效参数维度)。
A deep network's loss is invariant to continuous symmetries of its parameters: the logit shift, the ReLU rescaling, the LayerNorm scale, the per-head attention rotation. Adam's per-coordinate preconditioner drifts along each symmetry orbit, which pulls the trajectory off the symmetry quotient where the optimization lives and blurs the singular-learning rate the quotient makes readable. We build DDC, a Dead-Direction Conditioner that lifts a base optimizer into a $G$-equivariant one: it conditions the optimizer's state in the orbit decomposition of a $G$-invariant metric, so the trajectory stays a preconditioned gradient flow on the quotient $\bar\Theta = \Theta/G$. The construction carries four architectural gauges (cross-entropy shift, ReLU and SwiGLU rescaling, LayerNorm and RMSNorm scale, and a per-head $O(d_{\rm head})$ attention rotation matched to RoPE), proves exactly equivariant on an Adam base, and composes with a Muon base through a gauge-equivariant orthogonaliser. Respecting the symmetry changes both the minimum the optimizer reaches and what it leaves measurable there. On a language model trained past the point of fit, DDCAdam resists the over-training collapse AdamW falls into, holding a validation-train loss gap of 0.67 against 5.88, and reads the dead-direction rate in 32 of 65 layer-by-observable cells where AdamW reads it in 7. A vision transformer trained from scratch reaches lower validation loss (1.71 against 2.12) while compressing spare feed-forward capacity a matched AdamW leaves intact. On a Muon base, where the rotation gauge composes exactly, DDCMuon groks ten of eleven seeds at depth 24 that a plain Muon never reaches. Built into the optimizer, a network's gauge symmetry sharpens the minimum it finds and turns that minimum's geometry into something the trajectory can measure.
死方向调节器:面向深度网络的规范等变预处理方法 / Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks
本文提出了一种名为DDC的优化器增强方法,通过让优化器尊重神经网络参数的内在对称性(如缩放、旋转等),避免优化轨迹在对称方向上漂移,从而在语言模型和视觉模型上显著提升训练效果,并能精确测量模型中的‘死方向’(无效参数维度)。
源自 arXiv: 2606.29176