菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-30
📄 Abstract - EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces

Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a human-annotated benchmark for revision detection. EarlySciRev complements existing resources focused on late-stage revisions or synthetic rewrites and supports research on scientific writing dynamics, revision modelling, and LLM-assisted editing.

顶级标签: llm natural language processing data
详细标签: scientific writing revision dataset text revision latex traces llm evaluation 或 搜索:

EarlySciRev:一个从LaTeX写作痕迹中提取的早期科学修订数据集 / EarlySciRev: A Dataset of Early-Stage Scientific Revisions Extracted from LaTeX Writing Traces


1️⃣ 一句话总结

这篇论文创建了一个名为EarlySciRev的新数据集,它通过分析学术论文LaTeX源文件中作者注释掉的旧文本,自动提取了大量真实的早期写作修订记录,为研究写作过程和开发AI辅助写作工具提供了宝贵资源。

源自 arXiv: 2603.28515