菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-11
📄 Abstract - Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers

Grokking -- where a transformer on modular arithmetic suddenly transitions from near-chance to near-perfect validation accuracy -- is attributed to a Fourier circuit, but its timing, causal structure, and controllability remain poorly understood. We introduce the Frequency Synchronization Degree (FSD), a normalised, permutation-tested metric for Fourier circuit synchronisation requiring no prior circuit knowledge. Across nine modular addition configurations (primes p in {53, 71, 97, 113, 131}, three seeds), FSD synchronises 500-3,000 steps before grokking (mean lead +1,722 steps; all nine positive, sign-test p~0.004), and precedes a restricted-logit loss baseline (Nanda et al.'s excluded loss) in all nine cases, making it the earliest available predictor. We provide direct causal evidence that the inter-phase gap is a regularisation phenomenon: forking training at the FSD-ceiling step and varying weight decay lambda produces strictly monotone earlier grokking, with Delta_t proportional to 1/lambda. This law replicates across three primes (p in {53,97,131}; R^2=1.00 and R^2=0.99 for two clean cases), captured as Delta_t ~ C/lambda, consistent with (1/lambda)*log(||W_mem||/tau). Architecture ablations show an attention-only model groks with a strong FSD precursor; an MLP-only model never groks; a single-layer model's FSD lags, confirming the precursor is a multi-block circuit property.

顶级标签: machine learning llm theory
详细标签: grokking fourier circuit synchronization weight decay causal analysis 或 搜索:

电路同步先于泛化:来自Grokking Transformer中傅里叶结构的因果证据 / Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers


1️⃣ 一句话总结

本文发现,在训练过程中,Transformer模型内部负责计算的“傅里叶电路”各组件的同步化(用新指标FSD衡量)会先于模型整体泛化能力突然提升(即Grokking现象)数百至数千步发生,并且通过控制权重衰减可以精准预测和操控这一时间差,从而揭示了泛化飞跃的早期电路级前兆及其正则化驱动机制。

源自 arXiv: 2606.12966