A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization

📄 Abstract - A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization

Grokking the delayed transition from memorization to generalization in neural networks remains poorly understood, in part because prior empirical studies confound the roles of architecture, optimization, and regularization. We present a controlled study that systematically disentangles these factors on modular addition (mod 97), with matched and carefully tuned training regimes across models. Our central finding is that grokking dynamics are not primarily determined by architecture, but by interactions between optimization stability and regularization. Specifically, we show: (1) \textbf{depth has a non-monotonic effect}, with depth-4 MLPs consistently failing to grok while depth-8 residual networks recover generalization, demonstrating that depth requires architectural stabilization; (2) \textbf{the apparent gap between Transformers and MLPs largely disappears} (1.11$\times$ delay) under matched hyperparameters, indicating that previously reported differences are largely due to optimizer and regularization confounds; (3) \textbf{activation function effects are regime-dependent}, with GELU up to 4.3$\times$ faster than ReLU only when regularization permits memorization; and (4) \textbf{weight decay is the dominant control parameter}, exhibiting a narrow ``Goldilocks'' regime in which grokking occurs, while too little or too much prevents generalization. Across 3--5 seeds per configuration, these results provide a unified empirical account of grokking as an interaction-driven phenomenon. Our findings challenge architecture-centric interpretations and clarify how optimization and regularization jointly govern delayed generalization.

关于神经网络“顿悟”现象的系统性实证研究：深度、架构、激活函数与正则化 / A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization

1️⃣ 一句话总结

这篇论文通过一系列精心控制的实验发现，神经网络训练中出现的‘顿悟’现象（即模型从死记硬背突然转变为真正理解规律）主要不是由网络架构决定的，而是由优化过程的稳定性和正则化强度之间的微妙互动共同主导的。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要