面向生成式推荐的语言模型新词汇表接地初始化 / Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
1️⃣ 一句话总结
这篇论文发现,用均值初始化语言模型的新词汇会导致其语义特征模糊,难以被后续微调有效区分,并提出了一种简单有效的‘接地初始化’方法,即利用语言描述将新词汇预先映射到有意义的语义空间位置,从而在生成式推荐任务中显著提升模型性能。
Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.
面向生成式推荐的语言模型新词汇表接地初始化 / Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
这篇论文发现,用均值初始化语言模型的新词汇会导致其语义特征模糊,难以被后续微调有效区分,并提出了一种简单有效的‘接地初始化’方法,即利用语言描述将新词汇预先映射到有意义的语义空间位置,从而在生成式推荐任务中显著提升模型性能。
源自 arXiv: 2604.02324