菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-28
📄 Abstract - Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories

We show that replacing the rolling SVD of AdamW updates with a rolling SVD of loss gradients changes the diagnostic by 1-2 orders of magnitude. Performing SVD on the loss gradient instead of the AdamW update increases the measured perturbative coupling between SED directions and Linear Centroid Hypothesis (LCH) features from $ \bar{R}_k \approx 3 $--$9\times$ to $100$--$330\times$ across four single-task modular arithmetic operations, eliminating the apparent operation dependence in the original measurement. On a multitask transformer with a shared encoder, update-based SED gives $ \bar{R}_k \leq 1 $ -- an apparent failure of the diagnostic -- while per-operation gradient-based SED recovers $ \bar{R}_k = 20 $--$45\times$ across all four operations. Gradient aggregation across competing tasks is the main obstruction; performing SVD on per-task gradients resolves it. A causal intervention shows that constraining attention updates to any rank-3 subspace (whether SED-derived or random) accelerates grokking by approximately $2.3\times$ across random seeds and operations, while removing the rank-3 component has negligible effect under proper gradient-projection methodology. The SED-LCH coupling is therefore a strong diagnostic of where feature formation concentrates in parameter space, but it is not a unique causal pathway: the natural full-rank AdamW attention update is highly rank-redundant under our hyperparameters.

顶级标签: machine learning model training theory
详细标签: singular value decomposition linear centroid hypothesis gradient analysis grokkings attention mechanisms 或 搜索:

梯度方向敏感性揭示优化器轨迹所隐藏的线性中心耦合 / Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories


1️⃣ 一句话总结

这篇论文发现,直接用损失函数的梯度(而不是优化器更新量)进行奇异值分解,能更清晰地揭示神经网络中特定方向与线性中心特征之间的强耦合关系,表明这种耦合是特征形成的重要指标,但并非唯一因果路径,且模型注意力更新具有高度冗余性。

源自 arXiv: 2604.25143