菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-09
📄 Abstract - AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention

Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods improve efficiency but sacrifice precision. We propose AsyncTLS, a hierarchical sparse attention system that combines coarse-grained block filtering with fine-grained token selection to balance accuracy and efficiency, coupled with an asynchronous offloading engine that overlaps KV cache transfers with computation via temporal locality exploitation. Evaluated on Qwen3 and GLM-4.7-Flash across GQA, and MLA architectures, AsyncTLS achieves accuracy comparable to full attention while delivering 1.2x - 10.0x operator speedups and 1.3x - 4.7x end-to-end throughput improvements on 48k - 96k contexts.

顶级标签: llm systems model training
详细标签: sparse attention kv cache long-context inference efficient inference asynchronous offloading 或 搜索:

AsyncTLS:基于异步两级稀疏注意力机制的高效生成式大语言模型推理 / AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention


1️⃣ 一句话总结

这篇论文提出了一种名为AsyncTLS的新系统,它通过结合粗粒度的块筛选和细粒度的令牌选择来智能地减少计算量,并利用异步卸载技术让数据传输和计算同时进行,从而在保持高精度的前提下,大幅提升了处理超长文本时大语言模型的推理速度和效率。

源自 arXiv: 2604.07815