菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-04
📄 Abstract - ATLAS: Article Tracking, Linking, and Analysis of Swedish Encyclopedias

The digitization of old encyclopedias represents an important step to improve access to historically structured knowledge. Often, however, this process does not go beyond an optical character recognition, leaving all the underlying structure unexploited. In addition, many encyclopedias had multiple editions reflecting the evolution of knowledge. The lack of structure in the raw text makes it difficult to track changes across these editions. In this work, we built a pipeline to restore the text structure, where we extract the headwords and identify entries; categorize the entities; match entries across editions; and link entries to a Wikidata item. We applied this pipeline to the four major editions of \textit{Nordisk familjebok}, an authoritative Swedish encyclopedia published between 1876 and 1951. We could extract the headwords with an F1 score of 97.8\% and we obtained an F1 score of 93.4\% on the headword classification. On a small-scale evaluation, we reached a 93\% precision on the cross-edition matching, 85\% precision and 16.5\% recall on the Wikidata linking. This shows that an automated approach to digitized historical knowledge is possible. This should facilitate the preservation of general knowledge and the understanding of knowledge transmission. The datasets and programs are available online.

顶级标签: natural language processing data
详细标签: encyclopedia analysis entity linking cross-edition matching historical knowledge pipeline 或 搜索:

ATLAS:瑞典百科全书的条目追踪、链接与分析 / ATLAS: Article Tracking, Linking, and Analysis of Swedish Encyclopedias


1️⃣ 一句话总结

本文提出了一套自动化处理旧版百科全书的流程,能够从扫描文本中提取词条、识别实体类型、跨版本匹配条目并与维基数据关联,成功应用于瑞典《北日耳曼家族书》四个主要版本,准确率超过90%,为历史知识的数字化保存和跨版本追踪提供了高效方案。

源自 arXiv: 2605.02466