菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-16
📄 Abstract - Gating Enables Curvature: A Geometric Expressivity Gap in Attention

Multiplicative gating is widely used in neural architectures and has recently been applied to attention layers to improve performance and training stability in large language models. Despite the success of gated attention, the mathematical implications of gated attention mechanisms remain poorly understood. We study attention through the geometry of its representations by modeling outputs as mean parameters of Gaussian distributions and analyzing the induced Fisher--Rao geometry. We show that ungated attention operator is restricted to intrinsically flat statistical manifolds due to its affine structure, while multiplicative gating enables non-flat geometries, including positively curved manifolds that are unattainable in the ungated setting. These results establish a geometric expressivity gap between ungated and gated attention. Empirically, we show that gated models exhibit higher representation curvature and improved performance on tasks requiring nonlinear decision boundaries whereas they provide no consistent advantage on tasks with linear decision boundaries. Furthermore, we identify a structured regime in which curvature accumulates under composition, yielding a systematic depth amplification effect.

顶级标签: theory natural language processing model training
详细标签: attention mechanisms geometric deep learning fisher-rao geometry gated attention representation curvature 或 搜索:

门控机制实现曲率:注意力机制中的几何表达能力差距 / Gating Enables Curvature: A Geometric Expressivity Gap in Attention


1️⃣ 一句话总结

这篇论文从几何角度解释了为什么在注意力机制中加入乘法门控能提升模型性能,指出门控能让模型学习到更复杂的非线性决策边界,而无门控的注意力则只能表达简单的线性结构。

源自 arXiv: 2604.14702