菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-12
📄 Abstract - Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at this https URL.

顶级标签: multi-modal model training model evaluation
详细标签: multimodal llms fine-grained perception knowledge distillation visual question answering benchmark 或 搜索:

无需放大:面向细粒度多模态感知的区域到图像蒸馏 / Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception


1️⃣ 一句话总结

这篇论文提出了一种名为‘区域到图像蒸馏’的新训练方法,它能让多模态大语言模型在单次前向推理中就具备强大的细粒度视觉识别能力,从而避免了传统方法需要反复放大图像区域所带来的高延迟问题。

源自 arXiv: 2602.11858