SAB-LVLM: Significance-Aware Binarization for Large Vision-Language Models

📄 Abstract - SAB-LVLM: Significance-Aware Binarization for Large Vision-Language Models

Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal understanding, yet their enormous parameter scale and cross-modal computation incur substantial memory and latency overhead, severely limiting real-world deployment on resource-constrained devices. Binarization offers an attractive solution by drastically reducing storage and computational costs. However, existing binarization methods neglect the varying importance of weights across different layers and modalities. This causes parameters irrelevant to downstream tasks to be unnecessarily retained, whereas modality-critical weights may not be adequately optimized, resulting in significant performance degradation. To address these challenges, we develop a novel \underline{S}ignificance-\underline{A}ware \underline{B}inarization for \underline{L}arge \underline{V}ision-\underline{L}anguage \underline{M}odels (SAB-LVLM). Specifically, after constructing Hessian matrices for textual and visual inputs, we propose a spatial significance map to distinguish full-precision weights activated under a single modality from those activated across modalities. We then devise a modality-guided integration strategy to obtain the significance-aware binarization map, which measures weight significance across layers and modalities. Subsequently, this binarization map is incorporated into the binarization objective as an error reweighting term, and binarization fitting is performed through an alternating significance-weighted update scheme. Extensive experiments illustrate the superiority of our SAB-LVLM over existing binary PTQ methods under an approximately 1-bit compression constraint. Our code is accessible at this https URL.

SAB-LVLM：面向大型视觉-语言模型的显著性感知二值化方法 / SAB-LVLM: Significance-Aware Binarization for Large Vision-Language Models

1️⃣ 一句话总结

本文提出了一种名为SAB-LVLM的新方法，通过分析视觉和文本输入中不同层和模态权重的重要性，有选择地将大型视觉语言模型压缩为1比特二进制表示，从而在保持模型性能的同时大幅降低存储和计算开销，适合部署在内存和算力有限的设备上。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要