菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-23
📄 Abstract - StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross-modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them "structure-centric". Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StructXLIP additionally maximizes the mutual information between multimodal structural representations. This auxiliary optimization is intrinsically harder, guiding the model toward more robust and semantically stable minima, enhancing vision-language alignment. Beyond outperforming current competitors on cross-modal retrieval in both general and specialized domains, our method serves as a general boosting recipe that can be integrated into future approaches in a plug-and-play manner. Code and pretrained models are publicly available at: this https URL.

顶级标签: multi-modal model training computer vision
详细标签: vision-language alignment cross-modal retrieval structural representation fine-tuning edge maps 或 搜索:

StructXLIP:利用多模态结构线索增强视觉语言模型 / StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues


1️⃣ 一句话总结

这篇论文提出了一种名为StructXLIP的微调方法,通过提取并专门对齐图像边缘图和文本中的结构信息,显著提升了视觉语言模型在细节丰富的跨模态检索任务上的性能,使其学习到更鲁棒和语义稳定的特征。

源自 arXiv: 2602.20089