菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-05
📄 Abstract - Evaluating the impact of word embeddings on similarity scoring in practical information retrieval

Search behaviour is characterised using synonymy and polysemy as users often want to search information based on meaning. Semantic representation strategies represent a move towards richer associative connections that can adequately capture this complex usage of language. Vector Space Modelling (VSM) and neural word embeddings play a crucial role in modern machine learning and Natural Language Processing (NLP) pipelines. Embeddings use distributional semantics to represent words, sentences, paragraphs or entire documents as vectors in high dimensional spaces. This can be leveraged by Information Retrieval (IR) systems to exploit the semantic relatedness between queries and answers. This paper evaluates an alternative approach to measuring query statement similarity that moves away from the common similarity measure of centroids of neural word embeddings. Motivated by the Word Movers Distance (WMD) model, similarity is evaluated using the distance between individual words of queries and statements. Results from ranked query and response statements demonstrate significant gains in accuracy using the combined approach of similarity ranking through WMD with the word embedding techniques. The top performing WMD + GloVe combination outperforms all other state-of-the-art retrieval models including Doc2Vec and the baseline LSA model. Along with the significant gains in performance of similarity ranking through WMD, we conclude that the use of pre-trained word embeddings, trained on vast amounts of data, result in domain agnostic language processing solutions that are portable to diverse business use-cases.

顶级标签: natural language processing machine learning data
详细标签: word embeddings information retrieval similarity scoring word movers distance semantic search 或 搜索:

评估词嵌入在实际信息检索中对相似性评分的影响 / Evaluating the impact of word embeddings on similarity scoring in practical information retrieval


1️⃣ 一句话总结

这篇论文研究发现,在信息检索中,结合词移距离(WMD)与预训练词嵌入(如GloVe)来衡量查询与文档的相似性,比传统方法更准确,能更好地理解语言含义并适用于多种实际场景。

源自 arXiv: 2602.05734