菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-25
📄 Abstract - CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning

Multi-domain task-incremental learning requires a model to sequentially acquire knowledge across visually diverse domains without forgetting prior tasks, and without access to task identity at inference. Parameter-efficient methods built on frozen vision-language models have made strong progress, yet all existing approaches rely exclusively on visual features for task routing, confidence estimation, and encoder adaptation, leaving CLIP's cross-modal text embedding space entirely unexploited. We address this gap through three contributions. Text-space task routing replaces visual Gaussian matching with cosine similarity to frozen CLIP text prototypes, giving order-independent routing robust to data scarcity at zero parameter cost. Multi-prototype visual-textual confidence replaces single-Gaussian class modeling with K-means visual prototypes and cross-modal alignment scores under task-calibrated thresholds. Symmetric cross-modal gating extends per-layer Gumbel gates to the text encoder conditioned on batch image features, preserving cross-modal alignment on out-of-distribution inputs. On the MTIL benchmark spanning 11 datasets and 1201 classes, our method achieves 74.2% Transfer, 80.5% Average, and 88.7% Last under Order-I, surpassing the prior state of the art by 5.0, 3.7, and 3.0 percentage points with only 2.5M trainable parameters and no external data.

顶级标签: machine learning multi-modal
详细标签: task-incremental learning vision-language models cross-modal prompting continual learning clip 或 搜索:

跨模态自适应提示:用于多领域任务增量学习 / CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning


1️⃣ 一句话总结

该论文提出了一种名为CMAP的方法,通过巧妙利用CLIP模型的文本和图像双重信息,让AI模型在学习新视觉任务时既能记住旧知识,又能自动识别当前任务,在11个不同数据集上以极小参数量显著提升了任务增量学习的准确率。

源自 arXiv: 2605.25708