When LLaVA Meets Objects: Token Composition for Vision-Language-Models

📄 Abstract - When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute especially at inference time. To address this problem, we propose Mask-LLaVA, a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive VLMs. Namely, we combine mask-based object representations together with global tokens and local patch tokens. While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time, allowing to adapt the number of tokens during inference without the need to retrain the model and without a significant drop in performance. We evaluate the proposed approach on a suite of standard benchmarks showing results competitive to current token efficient methods and comparable to the original LLaVA baseline using only a fraction of visual tokens. Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.

当LLaVA遇见物体：视觉语言模型的令牌组合方法 / When LLaVA Meets Objects: Token Composition for Vision-Language-Models

1️⃣ 一句话总结

这篇论文提出了一种名为Mask-LLaVA的新方法，通过组合不同层级的视觉特征来大幅减少视觉语言模型所需的图像表示令牌数量，从而在保持性能的同时显著提升模型推理效率。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要