EarlySciRev:一个从LaTeX写作痕迹中提取的早期科学修订数据集 / EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces
1️⃣ 一句话总结
这篇论文创建了一个名为EarlySciRev的新数据集,它通过分析学术论文LaTeX源文件中作者注释掉的旧文本,自动提取了大量真实的早期写作修订记录,为研究写作过程和开发AI辅助写作工具提供了宝贵资源。
Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a human-annotated benchmark for revision detection. EarlySciRev complements existing resources focused on late-stage revisions or synthetic rewrites and supports research on scientific writing dynamics, revision modelling, and LLM-assisted editing.
EarlySciRev:一个从LaTeX写作痕迹中提取的早期科学修订数据集 / EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces
这篇论文创建了一个名为EarlySciRev的新数据集,它通过分析学术论文LaTeX源文件中作者注释掉的旧文本,自动提取了大量真实的早期写作修订记录,为研究写作过程和开发AI辅助写作工具提供了宝贵资源。
源自 arXiv: 2603.28515