AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention

📄 Abstract - AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention

Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods improve efficiency but sacrifice precision. We propose AsyncTLS, a hierarchical sparse attention system that combines coarse-grained block filtering with fine-grained token selection to balance accuracy and efficiency, coupled with an asynchronous offloading engine that overlaps KV cache transfers with computation via temporal locality exploitation. Evaluated on Qwen3 and GLM-4.7-Flash across GQA, and MLA architectures, AsyncTLS achieves accuracy comparable to full attention while delivering 1.2x - 10.0x operator speedups and 1.3x - 4.7x end-to-end throughput improvements on 48k - 96k contexts.

AsyncTLS：基于异步两级稀疏注意力机制的高效生成式大语言模型推理 / AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention

1️⃣ 一句话总结

这篇论文提出了一种名为AsyncTLS的新系统，它通过结合粗粒度的块筛选和细粒度的令牌选择来智能地减少计算量，并利用异步卸载技术让数据传输和计算同时进行，从而在保持高精度的前提下，大幅提升了处理超长文本时大语言模型的推理速度和效率。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要