菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-20
📄 Abstract - Semantic Entanglement in Vector-Based Retrieval: A Formal Framework and Context-Conditioned Disentanglement Pipeline for Agentic RAG Systems

Retrieval-Augmented Generation (RAG) systems depend on the geometric properties of vector representations to retrieve contextually appropriate evidence. When source documents interleave multiple topics within contiguous text, standard vectorization produces embedding spaces in which semantically distinct content occupies overlapping neighborhoods. We term this condition semantic entanglement. We formalize entanglement as a model-relative measure of cross-topic overlap in embedding space and define an Entanglement Index (EI) as a quantitative proxy. We argue that higher EI constrains attainable Top-K retrieval precision under cosine similarity retrieval. To address this, we introduce the Semantic Disentanglement Pipeline (SDP), a four-stage preprocessing framework that restructures documents prior to embedding. We further propose context-conditioned preprocessing, in which document structure is shaped by patterns of operational use, and a continuous feedback mechanism that adapts document structure based on agent performance. We evaluate SDP on a real-world enterprise healthcare knowledge base comprising over 2,000 documents across approximately 25 sub-domains. Top-K retrieval precision improves from approximately 32% under fixed-token chunking to approximately 82% under SDP, while mean EI decreases from 0.71 to 0.14. We do not claim that entanglement fully explains RAG failure, but that it captures a distinct preprocessing failure mode that downstream optimization cannot reliably correct once encoded into the vector space.

顶级标签: llm agents systems
详细标签: retrieval-augmented generation semantic entanglement embedding space document preprocessing retrieval precision 或 搜索:

基于向量检索的语义纠缠:一个形式化框架及面向智能体RAG系统的上下文条件解缠流程 / Semantic Entanglement in Vector-Based Retrieval: A Formal Framework and Context-Conditioned Disentanglement Pipeline for Agentic RAG Systems


1️⃣ 一句话总结

这篇论文发现当文档混合多个主题时,其向量表示会相互重叠(称为语义纠缠),从而降低检索精度,为此作者提出了一个能根据使用场景动态调整文档结构的预处理流程,显著提升了检索效果。

源自 arXiv: 2604.17677