语言模型跨语言泛化能力的体外研究 / An In-Vitro Study on Cross-Lingual Generalization in Language Models
1️⃣ 一句话总结
本研究通过构建两种仅有表面差异的人工语言,独立控制词汇距离、少数语言比例等变量,发现语言模型跨语言迁移的关键并非词汇相似度或分词器平衡,而在于分词能否保留可复用的跨语言子结构,并且较小的词汇表通过保持词语可分解为共享片段来提升迁移效果。
Cross-lingual transfer in language models is difficult to study in natural corpora because lexical overlap, morphology, data imbalance, and tokenization are entangled. We introduce an in-vitro framework with two procedurally generated languages that share the same ontology, typed grammar, and compositional structure, but differ in surface realization. This lets us independently vary lexical distance, minority-language proportion, tokenizer training regime, and vocabulary size, while evaluating transfer on a masked minority-language condition whose lexical forms are never observed during training. Across 700 controlled runs, we find that transfer is governed less by tokenizer balance or raw lexical similarity than by whether tokenization preserves reusable cross-lingual substructure. Smaller vocabularies often improve masked transfer by keeping words decomposable into shared fragments, whereas larger vocabularies can turn forms into language-specific atoms. We further show that transfer emerges as a staged process: grammatical and type-level competence precede masked lexical generalization. Finally, we attempt to explain this mechanism through tokenizer bridges and show that bridge strength correlates strongly with masked reachability.
语言模型跨语言泛化能力的体外研究 / An In-Vitro Study on Cross-Lingual Generalization in Language Models
本研究通过构建两种仅有表面差异的人工语言,独立控制词汇距离、少数语言比例等变量,发现语言模型跨语言迁移的关键并非词汇相似度或分词器平衡,而在于分词能否保留可复用的跨语言子结构,并且较小的词汇表通过保持词语可分解为共享片段来提升迁移效果。
源自 arXiv: 2605.26683