📄
Abstract - SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer
Diffusion Transformers have recently demonstrated remarkable performance in video generation. However, the long input sequences result in high computational latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free sparse attention is constrained by limited sparsity and thus offers modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation for training. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. By incorporating an input-dependent gating mechanism to finely balance the two branches, our method attains 90% sparsity and 1.72x inference speedup, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples and 1,600 training steps with a batch size of 8.
SALAD:通过高效的线性注意力微调实现视频扩散Transformer的高稀疏性注意力 /
SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer
1️⃣ 一句话总结
这篇论文提出了一种名为SALAD的新方法,通过在视频生成模型中巧妙地结合稀疏注意力和一个轻量级的线性注意力分支,并用一个智能门控机制来动态平衡两者,从而在几乎不损失生成质量的前提下,大幅提升了模型的计算效率,实现了90%的注意力稀疏度和1.72倍的推理加速,而且所需的训练数据和计算量非常少。