菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-10
📄 Abstract - Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths

Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long-context scalability. Gecko reaches a training loss of 1.68, significantly outperforming Llama2-7B (1.75) and Megalodon-7B (1.70), and landing close to Llama2-13B (1.67). Notably, without relying on any context-extension techniques, Gecko exhibits inherent long-context processing and retrieval capabilities, stably handling sequences of up to 4 million tokens and retrieving information from contexts up to $4\times$ longer than its attention window. Code: this https URL

顶级标签: natural language processing model training systems
详细标签: sequence modeling long-context processing neural architecture attention mechanism efficiency 或 搜索:

Gecko:一种高效且能固有处理任意长度序列的神经架构 / Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths


1️⃣ 一句话总结

这篇论文提出了一种名为Gecko的新型神经网络架构,它通过改进设计,能够比现有主流模型更高效地处理极长的文本序列,并且无需额外技术就能从超长上下文中准确检索信息。

源自 arXiv: 2601.06463