Transformer的热力学同构:一种基于拉格朗日方法的注意力动力学研究 / Thermodynamic Isomorphism of Transformers: A Lagrangian Approach to Attention Dynamics
1️⃣ 一句话总结
这篇论文首次从物理学的“最小作用量原理”出发,将Transformer模型中的注意力机制视为一个遵循热力学和信息动力学规律的系统,从而为人工智能的底层机制提供了一个统一的理论框架,并解释了模型训练中的涌现现象。
Although the Transformer architecture has revolutionized artificial intelligence, its underlying mechanisms remain largely heuristic and lack a unified physical theory. In this work, we propose a first-principles framework for information dynamics, treating the attention mechanism as a physical system governed by the principle of least action rather than as an algorithmic optimization. By mapping information states to a Riemannian manifold with the Fisher information metric, we derive the intelligence Lagrangian. We show that the softmax function corresponds to the unique thermodynamic equilibrium state that minimizes the Helmholtz free energy of the information gas. In addition, we identify the query-key interaction as an electrodynamic coupling between an external field and an intrinsic dipole moment. This theory establishes the first law of information thermodynamics, unifying inference (mechanical work) and learning (chemical evolution). It also explains emergent phenomena, such as scaling laws and grokking, as phase transitions characterized by the divergence of specific heat. Finally, we discuss how rotational symmetry breaking in the attention manifold generates massless Goldstone bosons, providing a field-theoretic perspective on rotary positional embeddings (RoPE). Our work connects Statistical Physics and Deep Learning, laying the groundwork for a general theory of physics-based intelligence.
Transformer的热力学同构:一种基于拉格朗日方法的注意力动力学研究 / Thermodynamic Isomorphism of Transformers: A Lagrangian Approach to Attention Dynamics
这篇论文首次从物理学的“最小作用量原理”出发,将Transformer模型中的注意力机制视为一个遵循热力学和信息动力学规律的系统,从而为人工智能的底层机制提供了一个统一的理论框架,并解释了模型训练中的涌现现象。
源自 arXiv: 2602.08216