TokAlign++:通过更好的词元对齐推进词汇自适应 / TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
1️⃣ 一句话总结
论文提出了一种名为TokAlign++的方法,通过将原始词汇和目标词汇视为两种语言并学习双向词元对齐词典,从而高效地调整大语言模型的词汇表,显著提升文本压缩率、保留模型能力,并使得不同模型之间的知识蒸馏更加有效。
Tokenization is a foundational step in the text process of Large Language Models (LLMs). Texts must be first tokenized into token IDs, which are then input to LLMs. Inefficient tokenization results in long token-ID sequences and will slow down the training and inference of LLMs. The fine-grained knowledge transfer between LLMs, like token-level distillation, is also impeded by the mismatch in vocabulary. To bridge this gap, we introduce a method named TokAlign++ to improve vocabulary adaptation performance by learning better token alignment lexicon. The source and target vocabularies are taken as two different languages, and the bilingual token alignment lexicon is learned from monolingual token representations. Model parameters are rearranged following this bilingual lexicon for new vocabulary, and progressively fine-tuned for adaptation. Experimental results on 15 languages show that our method boosts the multilingual text compression rates and preserves most of the multilingual ability of vanilla models. It costs as few as 1k steps to restore the performance of the vanilla model. After unifying vocabularies between vanilla models, token-level distillation remarkably improves the base model with only 235M tokens.
TokAlign++:通过更好的词元对齐推进词汇自适应 / TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
论文提出了一种名为TokAlign++的方法,通过将原始词汇和目标词汇视为两种语言并学习双向词元对齐词典,从而高效地调整大语言模型的词汇表,显著提升文本压缩率、保留模型能力,并使得不同模型之间的知识蒸馏更加有效。
源自 arXiv: 2605.13429