Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

📄 Abstract - Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Firebolt-VL achieves accurate, fine-grained understanding with significantly improved efficiency. Our model and code are available at: this https URL

Firebolt-VL：通过跨模态调制实现高效的视觉-语言理解 / Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

1️⃣ 一句话总结

这篇论文提出了一种名为Firebolt-VL的高效视觉-语言模型，它通过一种新颖的跨模态调制机制，在保持线性计算复杂度的同时，能更精准地关注与文本相关的图像细节，从而在资源有限设备上实现既快速又准确的图文理解。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要