📄
Abstract - Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search
Music search at the scale of Amazon Music presents a unique challenge: queries frequently deviate from indexed metadata due to misspellings, transpositions, and phonetic variations, yet the retrieval system must operate under strict millisecond-level latency constraints. Our existing learning-to-retrieve system, the High Confidence Index (HCI), learns query-entity associations from customer behavior, relying on continual ``exploration'' to choose candidates. Traditional n-gram matching enables this exploration but suffers from poor semantic robustness and high noise, limiting the system's ability to learn from long-tail queries. In this work, we present a \textbf{robust neural sparse retrieval system} designed to maximize exploration efficiency. We adapt a state-of-the-art \textbf{inference-free} sparse retrieval architecture to the music domain, combining it with an effective \textbf{domain-specific granular subword tokenization strategy}. Our approach utilizes short-length token constraints (max 3 chars) to enforce the learning of surface-form robustness over lexical memorization. By pre-computing the neural embeddings and term expansions during the offline indexing phase, online processing is reduced to minimal tokenization and IDF weighting, achieving effectively zero latency overhead for query encoding. Evaluations on a 6M-document production corpus show an aggregate \textbf{91.4\%} recall@10 (vs. \textbf{57.7\%} for trigrams) at comparable throughput. Simulation of the HCI feedback loop demonstrates improved exploration efficiency, with \textbf{+0.8\%} higher stabilized recall than production trigrams. Ablation studies indicate that our sparse training methodology drives the performance gains, while domain-specific pretraining provides a cost-effective alternative to large-scale general-purpose pretraining.
基于表面形式的神经稀疏检索:面向工业音乐搜索的鲁棒模糊匹配 /
Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search
1️⃣ 一句话总结
本文提出了一种专为亚马逊音乐等大规模工业搜索场景设计的神经稀疏检索系统,通过将离线预计算的神经嵌入与短字符token约束相结合,在毫秒级延迟下实现了对拼写错误、词序颠倒等用户查询变体的高效鲁棒匹配,相比传统n-gram方法召回率提升至91.4%。