菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-11
📄 Abstract - GLM-OCR Technical Report

GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.

顶级标签: multi-modal natural language processing computer vision
详细标签: document understanding optical character recognition multimodal model layout analysis structured generation 或 搜索:

GLM-OCR技术报告 / GLM-OCR Technical Report


1️⃣ 一句话总结

这篇论文介绍了一个名为GLM-OCR的高效轻量级多模态模型,它通过结合视觉编码器和语言解码器,并采用创新的多令牌预测机制,在保持低计算成本的同时,实现了对文档中文字、公式、表格等内容的出色识别与理解,适合在实际场景中部署。

源自 arXiv: 2603.10910