TopoChunker: Topology-Aware Agentic Document Chunking Framework

📄 Abstract - TopoChunker: Topology-Aware Agentic Document Chunking Framework

Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation'' that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.

TopoChunker：一种基于拓扑感知的智能文档分块框架 / TopoChunker: Topology-Aware Agentic Document Chunking Framework

1️⃣ 一句话总结

这篇论文提出了一个名为TopoChunker的新框架，它通过两个智能体协作来分析和切割文档，不仅能高效保留文档原有的层级结构和语义关联，从而显著提升后续信息检索的准确率，还能同时降低计算成本。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要