📄
Abstract - Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents
Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extracting visual content from documents are largely built around generic document layout analysis, where figures and tables are treated as uniformly relevant document objects rather than semantically meaningful analytical artifacts. In this work, we introduce a benchmark dataset and evaluation framework for \textit{data snapshot extraction}, the task of identifying and localizing semantically meaningful visual artifacts within institutional documents. The benchmark spans humanitarian reports, World Bank policy research working papers, and project appraisal documents, and includes annotations for figures and tables that contain reusable analytical information. Using this dataset, we benchmarked multiple open-source layout detection models and evaluated both detection performance and spatial extraction quality. Our results show that current models struggle to generalize to operational institutional documents despite strong performance on conventional academic benchmarks. Common failure modes include confusion between analytical and non-analytical content, fragmentation of composite analytical artifacts, and incomplete extraction of contextual information required for interpretation. These findings highlight a persistent gap between generic document layout analysis and operationally useful data snapshot extraction. We release the source PDFs, annotation dataset, metadata, and source code to support future research in operational document intelligence. The dataset is available at this https URL and the source code is available at this https URL.
面向机构文档数据快照提取的开源布局检测模型基准测试 /
Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents
1️⃣ 一句话总结
本文构建了一个包含人道主义报告和世界银行政策文件等机构文档的基准数据集,系统评估了多个开源布局检测模型在提取图中和表中可复用分析信息(即“数据快照”)方面的表现,发现这些模型在常规学术文档中表现良好,但在实际机构文档中容易混淆分析性内容与非分析性内容、拆分复合图表以及遗漏必要的上下文信息,揭示了通用文档布局分析与实用数据提取之间的显著差距。