📄
Abstract - Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention
Diffusion Language Models (DLMs) enable globally coherent, bidirectional, and controllable text generation, offering advantages over traditional autoregressive LLMs, while scaling to ultra-long sequences remains costly. Many existing block-sparse attention methods select blocks by fixed sampling patterns over the high-resolution attention space, such as tail regions or anti-diagonal stripes. Such prior-driven sampling can miss salient tokens and introduce instability under distribution shifts. In this paper, we propose the Block Approximate Sparse Attention framework (BA-Att) with block-wise pre-downsampled operation, which identifies informative regions within a compact downsampled space, avoiding reliance on brittle positional priors. To analyze its theoretical behavior, we define an oracle post-downsample attention map and formalize the approximation error between pre- and post-downsample schemes. Based on this insight, we introduce a lightweight norm-sorting module and a covariance-compensated correction that approximates full covariance using diagonal QK variances, reducing computational complexity. Extensive experiments show that our operator achieves up to 6.95x acceleration over FlashAttention in attention computation, and maintains near full-attention performance at 50% sparsity across language models, multimodal language models, and video generation models, demonstrating strong efficiency and generalization.
扩散语言模型中基于块近似稀疏注意力的长上下文高效建模 /
Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention
1️⃣ 一句话总结
本文提出了一种名为BA-Att的块近似稀疏注意力框架,通过在压缩后的低分辨率空间中识别重要信息区域,而非依赖固定的位置模式,实现了扩散语言模型在处理超长文本时的高效计算,在保持近乎完整注意力性能的同时,将计算速度提升了近7倍。