菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-21
📄 Abstract - Format-Constraint Coupling in Knowledge Graph Construction from Statistical Tables

An extraction schema should not reduce knowledge graph fidelity. On statistical CSV, however, it can. We study country-by-year time-series matrices, a common layout on open-data portals. In this setting, serialization format and schema constraints interact super-additively. Their joint effect exceeds the sum of independent effects by up to +1.180 (2x2 factorial, 6 datasets). Bootstrap 95% CIs are strictly positive on 4/6 datasets, with strongest evidence on wide Type-II matrices. More critically, a schema applied to a mismatched format can trigger catastrophic mismatch. Fact coverage falls below the unconstrained baseline on 4/6 datasets through entity inflation or extraction refusal. We call this observed pattern format-constraint coupling. Probing and token ablation support a surface-form anchoring explanation centred on column-name references. Controlled variants across format-schema pairings, GraphRAG hosts, and LLM families show the same direction within the measured scope; one LLM family shows only partial activation. The observation also has a diagnostic consequence. Three standard retrieval modes largely mask construction quality (delta <= 1pp), whereas direct graph access exposes gaps up to +47.6pp (p < 0.0001). To support fidelity-aware evaluation, we release CSVFidelity-Bench. It contains 15 datasets, 11 Type-II matrices, 4 Type-III tables, and 1,892 Gold Standard facts across 6 domains.

顶级标签: knowledge graph llm data
详细标签: knowledge graph construction statistical tables schema constraints benchmark format-coupling 或 搜索:

统计表格知识图谱构建中的格式-约束耦合问题 / Format-Constraint Coupling in Knowledge Graph Construction from Statistical Tables


1️⃣ 一句话总结

本文发现,在从统计表格(如CSV格式的时序数据)构建知识图谱时,数据格式与提取约束之间存在超加性交互效应,即两者共同作用会导致事实覆盖率严重下降(最多可达47.6个百分点),远超过各自单独影响的总和,作者称之为“格式-约束耦合”,并为此发布了专门用于评估该问题的基准测试集CSVFidelity-Bench。

源自 arXiv: 2605.21974