菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-10
📄 Abstract - Ego: Embedding-Guided Personalization of Vision-Language Models

AI assistants that support humans in daily life are becoming increasingly feasible, driven by the rapid advancements in multimodal language models. A key challenge lies in overcoming the generic nature of these models to deliver personalized experiences. Existing approaches to personalizing large vision language models often rely on additional training stages, which limit generality and scalability, or on engineered pipelines with external pre-trained modules, which hinder deployment efficiency. In this work, we propose an efficient personalization method that leverages the model's inherent ability to capture personalized concepts. Specifically, we extract visual tokens that predominantly represent the target concept by utilizing the model's internal attention mechanisms. These tokens serve as a memory of that specific concept, enabling the model to recall and describe it when it appears in test images. We conduct a comprehensive and unified evaluation of our approach and SOTA methods across various personalization settings including single-concept, multi-concept, and video personalization, demonstrating strong performance gains with minimal personalization overhead.

顶级标签: multi-modal model training computer vision
详细标签: vision-language models personalization attention mechanisms concept extraction efficient fine-tuning 或 搜索:

Ego:基于嵌入引导的视觉语言模型个性化方法 / Ego: Embedding-Guided Personalization of Vision-Language Models


1️⃣ 一句话总结

这篇论文提出了一种高效的方法,让通用视觉语言模型能记住并识别特定的人或物体,无需额外训练,只需利用模型内部的注意力机制提取关键视觉特征作为‘记忆’,就能在后续任务中快速实现个性化识别和描述。

源自 arXiv: 2603.09771