Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

📄 Abstract - Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

PDF files are primarily intended for human reading rather than automated processing. In addition, the heterogeneous content of PDFs, such as text, tables, and images, poses significant challenges for parsing and information extraction. To address these difficulties, both practitioners and researchers are increasingly developing new methods, including the promising Retrieval-Augmented Generation (RAG) systems to automated PDF processing. However, there is no comprehensive study investigating how different components and design choices affect the performance of a RAG system for understanding PDFs. In this paper, we propose such a study (1) by focusing on Question Answering, a specific language understanding task, and (2) by leveraging two benchmarks from the financial domain, including TableQuest, our newly generated, publicly available benchmark. We systematically examine multiple PDF parsers and chunking strategies (with varied overlap), along with their potential synergies in preserving document structure and ensuring answer correctness. Overall, our results offer practical guidelines for building robust RAG pipelines for PDF understanding.

面向金融问答RAG系统的PDF解析与分块策略实证评估 / Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

1️⃣ 一句话总结

这篇论文通过系统评估不同的PDF解析工具和文本分块策略，为构建更可靠的金融文档问答系统提供了实用的操作指南。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要