在主题模型中解耦相似性与关联性 / Disentangling Similarity and Relatedness in Topic Models
1️⃣ 一句话总结
这篇论文通过构建一个基于大语言模型的评估工具,揭示了不同主题模型在捕捉词汇的语义相似性和主题关联性上的差异,并证明这些差异能有效预测模型在下游任务中的表现。
The recent advancement of large language models has spurred a growing trend of integrating pre-trained language model (PLM) embeddings into topic models, fundamentally reshaping how topics capture semantic structure. Classical models such as Latent Dirichlet Allocation (LDA) derive topics from word co-occurrence statistics, whereas PLM-augmented models anchor these statistics to pre-trained embedding spaces, imposing a prior that also favours clustering of semantically similar words. This structural difference can be captured by the psycholinguistic dimensions of thematic relatedness and taxonomic similarity of the topic words. To disentangle these dimensions in topic models, we construct a large synthetic benchmark of word pairs using LLM-based annotation to train a neural scoring function. We apply this scorer to a comprehensive evaluation across multiple corpora and topic model families, revealing that different model families capture distinct semantic structure in their topics. We further demonstrate that similarity and relatedness scores successfully predict downstream task performance depending on task requirements. This paper establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterising these across model families and corpora.
在主题模型中解耦相似性与关联性 / Disentangling Similarity and Relatedness in Topic Models
这篇论文通过构建一个基于大语言模型的评估工具,揭示了不同主题模型在捕捉词汇的语义相似性和主题关联性上的差异,并证明这些差异能有效预测模型在下游任务中的表现。
源自 arXiv: 2603.10619