SemHash-LLM:一种多粒度语义哈希框架用于文档去重 / SemHash-LLM: A Multi-Granularity Semantic Hashing Framework for Document Deduplication
1️⃣ 一句话总结
本文提出了一种名为SemHash-LLM的文档去重框架,通过融合字符、词元和文档级别的语义哈希技术,并结合大语言模型进行少量关键判断,在大规模语料库中实现既高效又准确的重复文档检测,且仅需不到1%的神经网络验证成本。
Large scale document deduplication must preserve semantic equivalence while remaining efficient over massive corpora. We present SemHash LLM, a multi granularity framework that unifies semantic projection hashing, attention weighted MinHash, contrastive boundary learning, and selective LLM based adjudication. The method combines character, token, and document level signals through gated fusion, then applies a cascaded filtering pipeline for efficient candidate reduction. Semantic projection hashing learns compact binary codes in distilled LLM embedding space, while attention weighted Min- Hash suppresses boilerplate and emphasizes informative content. Adaptive decision boundaries and uncertainty estimation further improve robustness across template pollution, short text perturbation, containment, and viral fragments. Experiments show that SemHash LLM achieves strong duplicate detection quality with less than one percent neural verification cost.
SemHash-LLM:一种多粒度语义哈希框架用于文档去重 / SemHash-LLM: A Multi-Granularity Semantic Hashing Framework for Document Deduplication
本文提出了一种名为SemHash-LLM的文档去重框架,通过融合字符、词元和文档级别的语义哈希技术,并结合大语言模型进行少量关键判断,在大规模语料库中实现既高效又准确的重复文档检测,且仅需不到1%的神经网络验证成本。
源自 arXiv: 2607.01601