动词形态范式建模:土耳其语和希伯来语分词案例研究 / Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew
1️⃣ 一句话总结
这篇论文研究发现,Transformer模型处理土耳其语和希伯来语复杂动词形态的能力,高度依赖于分词策略是否与语言本身的形态结构(如土耳其语的黏着性、希伯来语的非连接性)相匹配。
We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish -- with its transparent morphological markers -- both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.
动词形态范式建模:土耳其语和希伯来语分词案例研究 / Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew
这篇论文研究发现,Transformer模型处理土耳其语和希伯来语复杂动词形态的能力,高度依赖于分词策略是否与语言本身的形态结构(如土耳其语的黏着性、希伯来语的非连接性)相匹配。
源自 arXiv: 2602.05648