Gecko:一种高效且能固有处理任意长度序列的神经架构 / Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths
1️⃣ 一句话总结
这篇论文提出了一种名为Gecko的新型神经网络架构,它通过改进设计,能够比现有主流模型更高效地处理极长的文本序列,并且无需额外技术就能从超长上下文中准确检索信息。
Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long-context scalability. Gecko reaches a training loss of 1.68, significantly outperforming Llama2-7B (1.75) and Megalodon-7B (1.70), and landing close to Llama2-13B (1.67). Notably, without relying on any context-extension techniques, Gecko exhibits inherent long-context processing and retrieval capabilities, stably handling sequences of up to 4 million tokens and retrieving information from contexts up to $4\times$ longer than its attention window. Code: this https URL
Gecko:一种高效且能固有处理任意长度序列的神经架构 / Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths
这篇论文提出了一种名为Gecko的新型神经网络架构,它通过改进设计,能够比现有主流模型更高效地处理极长的文本序列,并且无需额外技术就能从超长上下文中准确检索信息。
源自 arXiv: 2601.06463