菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-15
📄 Abstract - LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods estimate token importance from attention scores, which may introduce positional bias, while representation-based methods reduce visual redundancy based on feature relations or reconstruction errors, overlooking the global structure of the visual token set. In this paper, we revisit visual token compression from the perspective of low-rank compressibility. Across models and datasets, we observe that visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, we propose LRCP, a training-free compression framework that first estimates the dominant low-rank subspace of visual tokens via PCA, and then scores each token by its projection residual onto this subspace, retaining tokens that are poorly explained by the low-rank background. Extensive experiments show that LRCP achieves superior results, preserving 94.7% of the original image-understanding performance with an 88.9% token reduction and 97.8% of the average video-understanding accuracy with an 87.5% token reduction.

顶级标签: multi-modal llm
详细标签: visual token pruning low-rank compressibility efficiency attention-free 或 搜索:

LRCP:基于低秩可压缩性的视觉标记剪枝方法,用于高效的大型视觉语言模型 / LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs


1️⃣ 一句话总结

这篇论文发现视觉语言模型中的图像标记(visual tokens)具有天然的低秩结构,并据此提出了一种无需重新训练的剪枝方法:先通过PCA找出图像标记的主要低维子空间,然后根据每个标记偏离该子空间的程度来筛选出更有价值的部分,从而在不明显降低性能的情况下大幅减少模型计算量,例如在保留94.7%图像理解能力的同时,可以剪掉近九成的标记。

源自 arXiv: 2605.15621