← 返回列表

arXiv 提交日期: 2026-01-16

📄 Abstract - PubMed-OCR: PMC Open Access OCR Annotations

PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.

顶级标签: data natural language processing medical

PubMed-OCR：来自PubMed Central开放获取PDF的科学文献OCR标注数据集 / PubMed-OCR: PMC Open Access OCR Annotations

1️⃣ 一句话总结

这篇论文发布了一个名为PubMed-OCR的大规模数据集，它通过自动标注技术，从超过20万篇开放获取的科学文献PDF中提取了文本及其在页面上的精确位置信息，旨在支持需要理解文档版面布局的AI模型研究与应用。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2601.11425

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要