TopoChunker:一种基于拓扑感知的智能文档分块框架 / TopoChunker: Topology-Aware Agentic Document Chunking Framework
1️⃣ 一句话总结
这篇论文提出了一个名为TopoChunker的新框架,它通过两个智能体协作来分析和切割文档,不仅能高效保留文档原有的层级结构和语义关联,从而显著提升后续信息检索的准确率,还能同时降低计算成本。
Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation'' that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.
TopoChunker:一种基于拓扑感知的智能文档分块框架 / TopoChunker: Topology-Aware Agentic Document Chunking Framework
这篇论文提出了一个名为TopoChunker的新框架,它通过两个智能体协作来分析和切割文档,不仅能高效保留文档原有的层级结构和语义关联,从而显著提升后续信息检索的准确率,还能同时降低计算成本。
源自 arXiv: 2603.18409