菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-27
📄 Abstract - SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation

Vision Foundation Models (VFMs) have demonstrated impressive representational capabilities. However, adapting them to downstream tasks via full fine-tuning incurs prohibitive computational and storage overhead. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a compelling alternative, aiming to achieve performance parity with full fine-tuning at minimal training costs. Nonetheless, applying PEFT to VFMs for dense prediction tasks remains challenging due to the structural and distributional gaps. To bridge these gaps, we propose \textbf{S}cale-\textbf{I}ntegrated \textbf{G}lobal \textbf{M}odulation \textbf{A}dapter (\textbf{SIGMA}), a novel lightweight PEFT method, which consists of two modules: scale-adaptive fusion and semantic modulation. Specifically, the scale-adaptive fusion module is utilized to bridge structural gaps by enhancing the extraction of multi-granularity visual information. Furthermore, SIGMA introduces semantic modulation on the fusion features to perform global feature alignment to further eliminate the distribution gap. This design facilitates unified spatial and distributional adaptation, requiring only 1.72\% trainable parameters relative to the VFM backbone. Comprehensive experiments across various downstream dense tasks and multiple VFM backbones demonstrate that SIGMA achieves consistent and superior performance over state-of-the-art PEFT methods.

顶级标签: computer vision model training model evaluation
详细标签: parameter-efficient fine-tuning vision foundation model dense prediction scale-adaptive fusion semantic modulation 或 搜索:

SIGMA:弥合视觉基础模型适配中的结构与分布差距 / SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation


1️⃣ 一句话总结

本文提出了一种名为SIGMA的轻量级参数高效微调方法,通过引入尺度自适应融合和语义调制两个模块,分别解决视觉基础模型在密集预测任务中面临的结构和分布不匹配问题,仅用1.72%的可训练参数就能超越现有主流方法。

源自 arXiv: 2605.27893