菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-19
📄 Abstract - A Hybrid Protocol for Large-Scale Semantic Dataset Generation in Low-Resource Languages: The Turkish Semantic Relations Corpus

We present a hybrid methodology for generating large-scale semantic relationship datasets in low-resource languages, demonstrated through a comprehensive Turkish semantic relations corpus. Our approach integrates three phases: (1) FastText embeddings with Agglomerative Clustering to identify semantic clusters, (2) Gemini 2.5-Flash for automated semantic relationship classification, and (3) integration with curated dictionary sources. The resulting dataset comprises 843,000 unique Turkish semantic pairs across three relationship types (synonyms, antonyms, co-hyponyms) representing a 10x scale increase over existing resources at minimal cost ($65). We validate the dataset through two downstream tasks: an embedding model achieving 90% top-1 retrieval accuracy and a classification model attaining 90% F1-macro. Our scalable protocol addresses critical data scarcity in Turkish NLP and demonstrates applicability to other low-resource languages. We publicly release the dataset and models.

顶级标签: natural language processing data llm
详细标签: low-resource languages semantic relations dataset generation turkish nlp word embeddings 或 搜索:

一种面向低资源语言的大规模语义数据集生成混合协议:以土耳其语语义关系语料库为例 / A Hybrid Protocol for Large-Scale Semantic Dataset Generation in Low-Resource Languages: The Turkish Semantic Relations Corpus


1️⃣ 一句话总结

这篇论文提出了一种低成本、可扩展的混合方法,成功构建了土耳其语的大规模语义关系数据集,有效解决了低资源语言在自然语言处理中面临的数据稀缺问题。

源自 arXiv: 2601.13253