当注意力崩溃时:从结构到语义的分阶段感知视觉令牌剪枝 / When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics
1️⃣ 一句话总结
本文提出一种名为STS的两阶段视觉令牌剪枝方法,先通过排斥机制保留空间结构多样性,再根据指令语义精确筛选相关令牌,从而克服传统方法因注意力集中而丢失关键细节的问题,提升视觉语言模型推理效率与任务对齐能力。
Vision-Language Models (VLMs) have demonstrated remarkable capabilities but suffer from significant computational overhead during inference. While visual token pruning offers a promising solution, existing methods predominantly rely on initial attention scores. This single-metric paradigm presents a critical flaw: high attention scores inherently collapse onto semantically similar regions, thereby severely reducing feature diversity and discarding vital contextual details. To address this, we introduce Structure-to-Semantics (STS), a novel two-stage visual token pruning framework that explicitly decouples the pruning process. The first stage employs a repulsion-based sampling mechanism to maximize spatial and structural diversity. The second stage leverages instruction-aware cross-attention to precisely filter out prompt-irrelevant tokens. This two-stage synergy constitutes the core of STS, first ensuring geometric coverage and then refining the retained tokens according to semantic relevance. Extensive evaluations demonstrate that STS mitigates the redundancy caused by attention-based selection, improving both structural diversity and fine-grained task alignment of the preserved visual tokens.
当注意力崩溃时:从结构到语义的分阶段感知视觉令牌剪枝 / When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics
本文提出一种名为STS的两阶段视觉令牌剪枝方法,先通过排斥机制保留空间结构多样性,再根据指令语义精确筛选相关令牌,从而克服传统方法因注意力集中而丢失关键细节的问题,提升视觉语言模型推理效率与任务对齐能力。
源自 arXiv: 2606.03569