菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-04
📄 Abstract - EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs

Multimodal Large Language Models (MLLMs) have shown strong performance in vision-language tasks, but their inference efficiency is severely limited by the exponential growth of visual tokens in complex scenarios such as high-resolution images and videos. Existing visual token pruning methods mainly operate after visual encoding, overlooking the substantial computational cost incurred during the encoding stage. To address this issue, we propose EvoPrune, an early-stage visual token pruning method for MLLMs that performs pruning directly during visual encoding. Specifically, EvoPrune employs a layer-wise pruning strategy guided by token similarity, diversity, and attention-based importance to retain the most informative visual tokens at selected encoding layers. Extensive experiments on image and video benchmarks validate the effectiveness of EvoPrune. In particular, on the VideoMME dataset, EvoPrune achieves 2$\times$ inference speedup with less than 1% performance degradation, demonstrating its potential for latency-sensitive MLLM deployment.

顶级标签: multi-modal model training systems
详细标签: token pruning efficient inference multimodal llms visual encoding computational efficiency 或 搜索:

EvoPrune:面向高效多模态大语言模型的早期视觉令牌剪枝方法 / EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs


1️⃣ 一句话总结

这篇论文提出了一种名为EvoPrune的新方法,它在多模态大模型处理图像或视频的早期阶段就智能地筛选掉不重要的视觉信息块,从而在不明显影响模型性能的前提下,大幅提升了模型的推理速度。

源自 arXiv: 2603.03681