📄
Abstract - Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime
Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like $\sqrt{{\log(\beta+1)}/{\beta}}\exp(Ct)+\exp(-ct)$ in terms of the temperature parameter $\beta^{-1}\to 0$ and inference time $t\geq 0$. For the proof, we establish Lyapunov-type estimates for the zero-temperature equation, identify its limit as $t\to\infty$, and employ a stability estimate in Wasserstein space together with a quantitative Laplace principle to couple the two equations. Our result implies that for time scales of order $\log\beta$ the token distribution concentrates at the identified limiting distribution. Numerical experiments confirm this and, beyond that, complement our theory by showing that for finite $\beta$ and large $t$ the dynamics enter a different terminal phase, dominated by the spectrum of the value matrix.
低温条件下平均场Transformer的集中现象量化分析 /
Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime
1️⃣ 一句话总结
本研究通过数学分析证明,在深度Transformer模型推理时,大量token的分布会在短时间内快速集中到由注意力机制投影决定的特定分布上,且这种集中状态能稳定维持一段时间,其集中速度与温度参数和推理时间相关,并通过数值实验验证了理论预测。