菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-05
📄 Abstract - Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew

We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish -- with its transparent morphological markers -- both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.

顶级标签: natural language processing llm model evaluation
详细标签: tokenization morphology transformer models multilingual paradigm learning 或 搜索:

动词形态范式建模:土耳其语和希伯来语分词案例研究 / Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew


1️⃣ 一句话总结

这篇论文研究发现,Transformer模型处理土耳其语和希伯来语复杂动词形态的能力,高度依赖于分词策略是否与语言本身的形态结构(如土耳其语的黏着性、希伯来语的非连接性)相匹配。

源自 arXiv: 2602.05648