📄
Abstract - RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data
The proliferation of Reinforcement Learning from Verifiable Rewards (RLVR) datasets has exacerbated provenance collapse due to unclear lineage among existing datasets. To bridge this fragmented RLVR data landscape, we propose Atomic-source Tracing via Lineage-Aware Search (ATLAS), a systematic framework for tracing RLVR datasets back to their atomic sources, attributing over 99.7% of 1.45M instances to 20 atomic sources. Our analysis reveals that most RLVR datasets are variants of a small set of shared upstream sources, with few introducing genuinely new data, and many facing data contamination risks. These findings naturally motivate us to curate a new RLVR dataset, DAPO++, and to benchmark existing datasets from a lineage-aware perspective. To this end, we propose Source-level Counterfactual Attribution (SCA) as a guiding principle to curate a decontaminated training dataset with concentrated learning signals. Essentially, SCA measures a sample's marginal utility by comparing per-atomic-source RL checkpoints against a shared base model. Building upon these attribution signals, we further design a composite dataset quality score Q that strongly correlates with downstream RLVR performance. Experiments on Qwen3 series models verify that DAPO++ consistently improves performance on held-out benchmarks, while Q reliably predicts downstream RLVR training effectiveness. Our code and data is available at this https URL.
寻找RLVR数据集的源头:追溯数据血缘以构建更优的训练数据 /
RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data
1️⃣ 一句话总结
这篇论文发现当前众多用于强化学习(基于可验证奖励)的数据集大多来源于少数几个共享的原始数据源,并存在数据污染问题,因此提出了一套名为ATLAS的血缘追踪框架和一个新的高质量数据集DAPO++,通过追溯每个样本的原始来源来评估其价值,从而筛选出更干净、更有效的训练数据。