菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-08
📄 Abstract - From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly controllable, user-driven systems, i.e., Interactive DeLVM, enabling seamless interaction through natural visual cues such as scribbles, clicks, or drawing boxes. Specifically, by encoding interactions directly into the example input-output pairs, we keep the philosophy of visual in-context learning intact: enabling users to prompt models with unseen interactions without fine-tuning and empowering them to dynamically steer model predictions with personalized interactions. Our experiments demonstrate that SOTA visual in-context learning models fail to effectively leverage interaction cues, often ignoring user guidance entirely. In contrast, our method excels in controllable, user-guided scenarios, achieving improvements of $+7.95%$ IoU for interactive segmentation, $+2.46$ PSNR for directed super-resolution, and $-3.14%$ LPIPS for interactive object removal. With this, our work bridges the gap between rigid static task adaptation and fluid interactivity for user-centric visual in-context learning.

顶级标签: computer vision multi-modal model training
详细标签: in-context learning interactive segmentation user guidance visual prompting model adaptation 或 搜索:

从静态到交互:将视觉上下文学习模型适配于用户驱动任务 / From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks


1️⃣ 一句话总结

这篇论文提出了一种简单有效的方法,将原本只能被动接受示例的静态视觉上下文学习模型,改造成能通过用户涂鸦、点击或画框等自然交互方式进行实时引导和控制的智能系统,从而在图像分割、超分辨率和对象移除等任务上显著提升了交互性能。

源自 arXiv: 2604.06748