菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-21
📄 Abstract - ROMEVA: Geometry-Preserving Vocabulary Expansion for Roman Urdu Language Models

Multilingual Language Models like mBERT are widely used for low-resource NLP, yet their adaptation to morphologically inconsistent languages such as Roman Urdu remains underexplored. Roman Urdu spelling variation causes severe sub-word fragmentation, averaging 1.50 sub-words per token. We propose \textit{ROMEVA} (Roman Urdu Embedding-preserving Vocabulary Adaptation), which combines sub-word-average initialization and a PCA-guided anchor loss to stabilize embeddings during vocabulary expansion. Using a 36,130-comment Roman Urdu corpus, we add 500 highly fragmented tokens to mBERT and compare naive fine-tuning, sub-word-aware fine-tuning, and \textit{ROMEVA}. While \textit{ROMEVA} most effectively preserves the pretrained embedding space, naive fine-tuning achieves the strongest downstream sentiment classification performance. These findings reveal a disconnect between embedding stability and downstream performance, suggesting that stronger adaptation may be preferable to strict embedding preservation in morphologically inconsistent languages.

顶级标签: natural language processing llm
详细标签: multilingual models vocabulary expansion low-resource nlp roman urdu sentiment classification 或 搜索:

ROMEVA:面向罗马乌尔都语语言模型的几何保持词汇扩展方法 / ROMEVA: Geometry-Preserving Vocabulary Expansion for Roman Urdu Language Models


1️⃣ 一句话总结

这篇论文针对罗马乌尔都语拼写不规范导致多语言模型分词效率低的问题,提出了一种名为ROMEVA的词汇扩展方法,通过结合子词平均初始化和PCA引导的锚点损失来稳定词嵌入,但实验发现虽然该方法能最好地保留预训练模型的嵌入空间,但在情感分类任务中,简单的微调反而表现更好,说明对于拼写不固定的语言,过度保持原有嵌入可能不如让模型更灵活地适应新词汇。

源自 arXiv: 2606.22478