菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-28
📄 Abstract - Wiki Dumps to Training Corpora: South Slavic Case

This paper presents a methodology for transforming raw Wikimedia dumps into quality textual corpora for seven South Slavic languages. The work is divided into two major phases. The first involves extracting and cleaning text from raw dumps of Wikipedia, Wikisource, Wikibooks, Wikinews, and Wikiquote, where available. This step requires careful handling of raw wiki markup to isolate, first of all, textual articles, and then usable natural language text within them. The second phase addresses the challenge of suspicious or low-quality articles, which are often generated from databases or structured knowledge bases. These articles are characterised by repetitive patterns, generic phrasing, and minimal to no original content. To mitigate their impact, a n-gram-based filtering strategy was employed to detect high levels of textual redundancy between articles and then remove such articles from the corpora entirely. The resulting datasets aim to provide linguistically rich texts suitable for training language models or conducting comparative research across South Slavic languages. By combining systematic extraction with quality control, this work contributes to the creation of reliable, high-information corpora that reflect authentic language use and cultural context. While focused on the South Slavic case in the paper, the approach is mostly language-agnostic and can be generalised to other languages and language families.

顶级标签: data natural language processing llm
详细标签: corpus creation data cleaning wikimedia quality filtering slavic languages 或 搜索:

维基数据到训练语料库:以南斯拉夫语为例 / Wiki Dumps to Training Corpora: South Slavic Case


1️⃣ 一句话总结

本文提出了一种将维基媒体平台(如维基百科、维基文库等)的原始数据转化为高质量文本语料库的方法,专注于七种南斯拉夫语言,通过提取并清洗文本,再利用n-gram技术识别并剔除重复、低质量的文章,最终生成适合训练语言模型或进行跨语言研究的可靠数据集。

源自 arXiv: 2604.25384