Empirical Recipes for Efficient and Compact Vision-Language Models

📄 Abstract - Empirical Recipes for Efficient and Compact Vision-Language Models

Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based on these findings, we develop optimization recipes tailored to compact VLMs that substantially reduce latency while preserving accuracy. These techniques cut time to first token (TTFT) by 53% on InternVL3-2B and by 93% on SmolVLM-256M. Our recipes are broadly applicable across both VLM architectures and common serving frameworks, providing practical guidance for building efficient VLM systems. Beyond efficiency, we study how to extend compact VLMs with structured perception outputs and introduce the resulting model family, ArgusVLM. Across diverse benchmarks, ArgusVLM achieves strong performance while maintaining a compact and efficient design.

高效紧凑视觉语言模型的实用优化方案 / Empirical Recipes for Efficient and Compact Vision-Language Models

1️⃣ 一句话总结

这篇论文通过系统分析发现，小型视觉语言模型的实际推理速度远低于预期，并据此提出了一套实用的优化方案，能显著降低模型响应延迟而不损失精度，同时展示了如何为这类紧凑模型扩展结构化视觉感知能力。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要