菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-04
📄 Abstract - When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute especially at inference time. To address this problem, we propose Mask-LLaVA, a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive VLMs. Namely, we combine mask-based object representations together with global tokens and local patch tokens. While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time, allowing to adapt the number of tokens during inference without the need to retrain the model and without a significant drop in performance. We evaluate the proposed approach on a suite of standard benchmarks showing results competitive to current token efficient methods and comparable to the original LLaVA baseline using only a fraction of visual tokens. Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.

顶级标签: llm multi-modal model training
详细标签: vision-language models token efficiency object representation adaptive inference mask-based features 或 搜索:

当LLaVA遇见物体:视觉语言模型的令牌组合方法 / When LLaVA Meets Objects: Token Composition for Vision-Language-Models


1️⃣ 一句话总结

这篇论文提出了一种名为Mask-LLaVA的新方法,通过组合不同层级的视觉特征来大幅减少视觉语言模型所需的图像表示令牌数量,从而在保持性能的同时显著提升模型推理效率。

源自 arXiv: 2602.04864