ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference

📄 Abstract - ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference

While Large Vision-Language Models (LVLMs) demonstrate exceptional multi-modal capabilities, the quadratic computational cost of processing high-resolution visual tokens remains a critical bottleneck. Though recent token reduction strategies attempt to accelerate inference, such methods inadequately exploit attention values and fail to address token redundancy. More critically, they overlook the ``attention shift'' phenomenon inherent in LVLMs, which skews token attention scores. In this work, we propose ASAP, a novel training-free, KV-Cache-compatible pruning recipe that comprehensively addresses these limitations. First, we mitigate the attention shift by utilizing a dynamic bidirectional soft attention mask, ensuring the selection of genuinely informative tokens rather than naive attention-based selection. Second, we posit that high semantic redundancy within the token set degrades performance. We therefore introduce a weighted soft merging component that merges semantically similar tokens, preserving only the most feature-dense visual patches for subsequent layers. ASAP achieves virtually lossless compression of visual context, retaining 99.02% of the original LLaVA-NeXT-7B performance while aggressively slashing computational FLOPs by ~80%.

ASAP：面向高效大视觉语言模型推理的注意力偏移感知剪枝方法 / ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference

1️⃣ 一句话总结

这篇论文提出了一种名为ASAP的新方法，它通过动态调整注意力并合并相似信息块，在不需额外训练的情况下，大幅减少了大型视觉语言模型处理图像时的计算量（约80%），同时几乎不损失模型性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要