菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-08
📄 Abstract - Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline's peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline's peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at this https URL.

顶级标签: multi-modal model training model evaluation
详细标签: adaptive perception efficient inference query-aware visual tokens sparse attention 或 搜索:

Q-Zoom:面向高效多模态大语言模型的查询感知自适应感知方法 / Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models


1️⃣ 一句话总结

这篇论文提出了一种名为Q-Zoom的智能方法,它能让多模态大模型在处理高分辨率图像时,像人一样根据具体问题‘有选择地放大’关键区域,从而在保持甚至提升识别精度的同时,大幅提升处理速度。

源自 arXiv: 2604.06912