HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

📄 Abstract - HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution this http URL address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.

HyperVL：一种面向边缘设备的高效动态多模态大语言模型 / HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

1️⃣ 一句话总结

这篇论文提出了一种名为HyperVL的新型高效多模态大模型，它通过创新的视觉分辨率压缩器和双一致性学习技术，在保证强大图像理解能力的同时，大幅降低了计算和内存开销，从而成功地将复杂的多模态AI应用部署到了手机等资源受限的边缘设备上。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要