📄
Abstract - SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers
The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: this https URL
SampoNLP:用于子词分词器形态学分析的自指工具包 /
SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers
1️⃣ 一句话总结
这篇论文提出了一个名为SampoNLP的工具包,它能自动为芬兰语、匈牙利语等形态复杂的语言创建高质量的形态学词典,并利用这些词典首次系统评估了不同词汇量下BPE分词器的性能,为这些语言找到了最优的词汇量大小,揭示了标准BPE方法在处理高度黏着语时的局限性。