CodeOCR:论视觉语言模型在代码理解中的有效性 / CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
1️⃣ 一句话总结
这篇论文提出了一种创新思路,将源代码转换成图像让视觉语言模型来理解,从而在保持高准确率的同时,将处理代码所需的计算量大幅压缩了最高8倍,为解决大模型处理大规模代码时的效率瓶颈提供了新方向。
Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.
CodeOCR:论视觉语言模型在代码理解中的有效性 / CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
这篇论文提出了一种创新思路,将源代码转换成图像让视觉语言模型来理解,从而在保持高准确率的同时,将处理代码所需的计算量大幅压缩了最高8倍,为解决大模型处理大规模代码时的效率瓶颈提供了新方向。
源自 arXiv: 2602.01785