Cross-Temporal Sinhala OCR: Page-Level Adaptation and Diachronic Analysis

📄 Abstract - Cross-Temporal Sinhala OCR: Page-Level Adaptation and Diachronic Analysis

Sinhala is a morphologically rich abugida spoken by roughly 16 million people in Sri Lanka, and to date, there are no publicly available real-world datasets for page-level Sinhala OCR. All previous studies for assessing Sinhala OCR models have used artificially generated data. To bridge the gap, we introduce sinhala-ocr-lk-acts-1010, an annotated dataset of 1,010 page-level images and their transcriptions collected from Sri Lankan Legislative Acts published between 1981-1989 and 2000-2019, split into 707 training examples, 101 validation examples, and 202 testing examples. Three models based on deep learning-based visual language processing, namely DeepSeek-OCR V1, DeepSeek-OCR V2, and LightOnOCR-2-1B, are fine-tuned using QLoRA in 8 experiments conducted on consumer and cloud GPUs. LightOnOCR-2-1B is the top performer, achieving a CER of 1.05% across all test examples, outperforming state-of-the-art open-source OCR models such as Surya-OCR (8.84%) and Tesseract v5 (10.69%), as well as commercially available OCR models such as Google Document AI (2.06%). Our results suggest that LightOnOCR-2-1B outperforms other baselines on real-world OCR tasks and maintains consistent performance across all print periods, even when documents are severely degraded.

跨时间僧伽罗语OCR：页面级自适应与历时分析 / Cross-Temporal Sinhala OCR: Page-Level Adaptation and Diachronic Analysis

1️⃣ 一句话总结

本文首次构建了包含1010张真实历史法律文档图像的僧伽罗语OCR数据集，并通过微调深度学习模型发现LightOnOCR-2-1B在页面级文字识别上表现最佳，其字符错误率仅1.05%，显著优于现有开源及商业OCR系统，且对不同印刷年代的老旧文档均保持稳定性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要