菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-24
📄 Abstract - Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

The quadratic complexity of standard attention mechanisms poses a significant scalability bottleneck for large language models (LLMs) in long-context scenarios. While hybrid attention strategies that combine sparse and full attention within a single model offer a viable solution, they typically employ static computation ratios (i.e., fixed proportions of sparse versus full attention) and fail to adapt to the varying sparsity sensitivities of downstream tasks during inference. To address this issue, we propose Elastic Attention, which allows the model to dynamically adjust its overall sparsity based on the input. This is achieved by integrating a lightweight Attention Router into the existing pretrained model, which dynamically assigns each attention head to different computation modes. Within only 12 hours of training on 8xA800 GPUs, our method enables models to achieve both strong performance and efficient inference. Experiments across three long-context benchmarks on widely-used LLMs demonstrate the superiority of our method.

顶级标签: llm model training systems
详细标签: efficient transformers adaptive sparsity long-context attention mechanisms inference optimization 或 搜索:

弹性注意力:面向高效Transformer的测试时自适应稀疏度比率 / Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers


1️⃣ 一句话总结

这篇论文提出了一种名为‘弹性注意力’的新方法,让大语言模型在处理长文本时,能够根据输入内容动态调整计算量,在保持高性能的同时实现更高效的推理。

源自 arXiv: 2601.17367