菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-25
📄 Abstract - Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions - primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data-training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware Training Recipe introduces progressive learning and structure-token optimization to enhance structural fidelity and decoding stability. We further build Wild-OmniDocBench, a benchmark derived from real-world captured documents for robustness evaluation. Integrated into a 1B-parameter MLLM, our method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios. All models, data synthesis pipelines, and benchmarks will be publicly released to advance future research in document understanding.

顶级标签: multi-modal model training data
详细标签: document parsing scene synthesis benchmark multimodal llm end-to-end training 或 搜索:

迈向真实世界文档解析:通过真实场景合成与文档感知训练 / Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training


1️⃣ 一句话总结

这篇论文提出了一种结合大规模合成数据和针对性训练策略的新方法,有效解决了现有模型在解析真实世界复杂文档时出现的结构混乱和内容错误问题,显著提升了文档解析的准确性和鲁棒性。

源自 arXiv: 2603.23885