菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-26
📄 Abstract - A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization

Grokking the delayed transition from memorization to generalization in neural networks remains poorly understood, in part because prior empirical studies confound the roles of architecture, optimization, and regularization. We present a controlled study that systematically disentangles these factors on modular addition (mod 97), with matched and carefully tuned training regimes across models. Our central finding is that grokking dynamics are not primarily determined by architecture, but by interactions between optimization stability and regularization. Specifically, we show: (1) \textbf{depth has a non-monotonic effect}, with depth-4 MLPs consistently failing to grok while depth-8 residual networks recover generalization, demonstrating that depth requires architectural stabilization; (2) \textbf{the apparent gap between Transformers and MLPs largely disappears} (1.11$\times$ delay) under matched hyperparameters, indicating that previously reported differences are largely due to optimizer and regularization confounds; (3) \textbf{activation function effects are regime-dependent}, with GELU up to 4.3$\times$ faster than ReLU only when regularization permits memorization; and (4) \textbf{weight decay is the dominant control parameter}, exhibiting a narrow ``Goldilocks'' regime in which grokking occurs, while too little or too much prevents generalization. Across 3--5 seeds per configuration, these results provide a unified empirical account of grokking as an interaction-driven phenomenon. Our findings challenge architecture-centric interpretations and clarify how optimization and regularization jointly govern delayed generalization.

顶级标签: theory model training machine learning
详细标签: grokking generalization optimization regularization neural networks 或 搜索:

关于神经网络“顿悟”现象的系统性实证研究:深度、架构、激活函数与正则化 / A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization


1️⃣ 一句话总结

这篇论文通过一系列精心控制的实验发现,神经网络训练中出现的‘顿悟’现象(即模型从死记硬背突然转变为真正理解规律)主要不是由网络架构决定的,而是由优化过程的稳定性和正则化强度之间的微妙互动共同主导的。

源自 arXiv: 2603.25009