CGPT: Cluster-Guided Partial Tables with LLM-Generated Supervision for Table Retrieval

📄 Abstract - CGPT: Cluster-Guided Partial Tables with LLM-Generated Supervision for Table Retrieval

General-purpose embedding models have demonstrated strong performance in text retrieval but remain suboptimal for table retrieval, where highly structured content leads to semantic compression and query-table mismatch. Recent LLM-based retrieval augmentation methods mitigate this issue by generating synthetic queries, yet they often rely on heuristic partial-table selection and seldom leverage these synthetic queries as supervision to improve the embedding model. We introduce CGPT, a training framework that enhances table retrieval through LLM-generated supervision. CGPT constructs semantically diverse partial tables by clustering table instances using K-means and sampling across clusters to broaden semantic coverage. An LLM then generates synthetic queries for these partial tables, which are used in hard-negative contrastive fine-tuning to refine the embedding model. Experiments across four public benchmarks (MimoTable, OTTQA, FetaQA, and E2E-WTQ) show that CGPT consistently outperforms retrieval baselines, including QGpT, with an average R@1 improvement of 16.54 percent. In a unified multi-domain corpus setting, CGPT further demonstrates strong cross-domain generalization and remains effective even when using smaller LLMs for synthetic query generation. These results indicate that semantically guided partial-table construction, combined with contrastive training from LLM-generated supervision, provides an effective and scalable paradigm for large-scale table retrieval. Our code is available at this https URL.

CGPT：用于表格检索的基于聚类引导的部分表格与LLM生成监督 / CGPT: Cluster-Guided Partial Tables with LLM-Generated Supervision for Table Retrieval

1️⃣ 一句话总结

这篇论文提出了一个名为CGPT的训练框架，它通过聚类技术构建语义多样的部分表格，并利用大语言模型为这些表格生成查询作为监督信号，通过对比学习微调嵌入模型，从而显著提升了大规模表格检索的效果和跨领域泛化能力。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要