HyperVL:一种面向边缘设备的高效动态多模态大语言模型 / HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
1️⃣ 一句话总结
这篇论文提出了一种名为HyperVL的新型高效多模态大模型,它通过创新的视觉分辨率压缩器和双一致性学习技术,在保证强大图像理解能力的同时,大幅降低了计算和内存开销,从而成功地将复杂的多模态AI应用部署到了手机等资源受限的边缘设备上。
Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution this http URL address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.
HyperVL:一种面向边缘设备的高效动态多模态大语言模型 / HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
这篇论文提出了一种名为HyperVL的新型高效多模态大模型,它通过创新的视觉分辨率压缩器和双一致性学习技术,在保证强大图像理解能力的同时,大幅降低了计算和内存开销,从而成功地将复杂的多模态AI应用部署到了手机等资源受限的边缘设备上。
源自 arXiv: 2512.14052