MuRF:解锁视觉基础模型的多尺度潜力 / MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models
1️⃣ 一句话总结
本文提出了一种名为MuRF的通用方法,它无需额外训练,只需在推理时将同一张图片以不同分辨率输入冻结的视觉基础模型并融合其特征,就能显著提升模型在各种视觉任务上的表现,因为它巧妙地结合了低分辨率图像的全局语义理解和高分辨率图像的细节识别能力。
Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.
MuRF:解锁视觉基础模型的多尺度潜力 / MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models
本文提出了一种名为MuRF的通用方法,它无需额外训练,只需在推理时将同一张图片以不同分辨率输入冻结的视觉基础模型并融合其特征,就能显著提升模型在各种视觉任务上的表现,因为它巧妙地结合了低分辨率图像的全局语义理解和高分辨率图像的细节识别能力。
源自 arXiv: 2603.25744