菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-24
📄 Abstract - Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation

A sliding-window inference strategy is commonly adopted in recent training-free open-vocabulary semantic segmentation methods to overcome limitation of the CLIP in processing high-resolution images. However, this approach introduces a new challenge: each window is processed independently, leading to semantic discrepancy across windows. To address this issue, we propose Global-Local Aligned CLIP~(GLA-CLIP), a framework that facilitates comprehensive information exchange across windows. Rather than limiting attention to tokens within individual windows, GLA-CLIP extends key-value tokens to incorporate contextual cues from all windows. Nevertheless, we observe a window bias: outer-window tokens are less likely to be attended, since query features are produced through interactions within the inner window patches, thereby lacking semantic grounding beyond their local context. To mitigate this, we introduce a proxy anchor, constructed by aggregating tokens highly similar to the given query from all windows, which provides a unified semantic reference for measuring similarity across both inner- and outer-window patches. Furthermore, we propose a dynamic normalization scheme that adjusts attention strength according to object scale by dynamically scaling and thresholding the attention map to cope with small-object scenarios. Moreover, GLA-CLIP can be equipped on existing methods and broad their receptive field. Extensive experiments validate the effectiveness of GLA-CLIP in enhancing training-free open-vocabulary semantic segmentation performance. Code is available at this https URL.

顶级标签: computer vision model evaluation natural language processing
详细标签: open-vocabulary segmentation clip semantic segmentation training-free attention mechanism 或 搜索:

超越窗口:用于免训练开放词汇语义分割的全局-局部对齐CLIP / Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation


1️⃣ 一句话总结

这项研究提出了一种名为GLA-CLIP的新框架,它通过让不同图像窗口之间进行信息交流并引入一个统一的语义参考点,有效解决了现有免训练图像分割方法中因窗口独立处理而导致的语义不一致问题,从而显著提升了分割精度。

源自 arXiv: 2603.23030