📄
Abstract - EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts
Local factual edits in scientific manuscripts often create non-local revision obligations. If a dataset changes from 215 to 80 documents, claims such as 'medium-scale' or 'a few hundred items' may also become stale, even though they do not repeat the edited number. In an audit of recent arXiv cs.CL benchmark and dataset papers, we find fact-dependent qualitative claims in 37.2% of papers, suggesting that this dependency pattern is common in the target genre. We introduce EditPropBench, a benchmark for measuring whether LLM editors propagate factual edits through dependent manuscript claims. Each item contains an ML/NLP-style synthetic manuscript, a targeted edit, and a controlled fact graph with sentence-level labels for direct targets, required downstream updates, and unrelated text that should remain unchanged. We summarize cascade success with Edit-Ripple Adherence (ERA), the fraction of required downstream updates correctly revised, and validate the metric with adversarial probes and stress-test variants. On the hardest cases, where dependent claims use implicit or free-form wording rather than repeating the edited value, five LLM editing systems span ERA 0.148-0.705. Even the strongest misses roughly 30% of required cascade updates. This advantage persists in a mixed evaluation that includes easy cases solvable by deterministic substitution. EditPropBench shows that current LLM editors can repair many implicit consequences of factual edits, but reliable scientific revision still requires cascade-aware checking.
可编辑科学手稿中的事实性编辑传播评测基准 /
EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts
1️⃣ 一句话总结
本论文提出了一个名为EditPropBench的评测基准,用于衡量大型语言模型(LLM)在修改科学手稿中的某个事实性数据(如数字、规模描述)后,是否能自动地、连贯地更新手稿中所有依赖该数据的相关描述(例如,当数据从215变为80时,能否自动将“中等规模”改为“小规模”),实验发现即使最先进的LLM编辑器也会遗漏约30%的必要连锁更新。