End-to-End Test-Time Training for Long Context

📄 Abstract - End-to-End Test-Time Training for Long Context

We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.

面向长上下文的端到端测试时训练 / End-to-End Test-Time Training for Long Context

1️⃣ 一句话总结

这篇论文提出了一种新方法，将长文本建模视为一个持续学习问题，通过让模型在测试时根据当前文本内容自我学习，并利用训练时的元学习进行优化，从而在保持推理速度的同时，实现了与标准全注意力模型相当的性能扩展能力。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要