菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-04
📄 Abstract - Controllable and Verifiable Process Data Synthesis for Process Reward Models

Process reward models (PRMs) rely on high-quality process supervision data, yet existing construction methods often provide limited control over error location, error type, and trajectory consistency. We propose a controllable and verifiable framework for synthesizing process supervision data for PRMs. Our framework first constructs a correct symbolic reasoning chain, injects a template-aware error into an intermediate step, recomputes subsequent steps under the corrupted state, and verifies that the injected step is not derivable from its prefix. The resulting paired trajectories are prefix-invalid at the first error while remaining trajectory-consistent after symbolic recomputation, and are translated into aligned natural-language processes for PRM training and evaluation. Experiments show that the synthesized data improve Best-of-8 reranking on logical reasoning benchmarks and transfer to mathematical reasoning. Step-level evaluation further shows that first-error localization remains substantially more challenging than overall step classification, highlighting the need for fine-grained and verifiable process supervision.

顶级标签: llm model training model evaluation
详细标签: process reward models data synthesis reasoning supervision 或 搜索:

面向过程奖励模型的可控且可验证的过程数据合成 / Controllable and Verifiable Process Data Synthesis for Process Reward Models


1️⃣ 一句话总结

本文提出了一种新方法,能够自动生成高质量的训练数据(过程监督数据),帮助AI模型更好地识别推理过程中的每一步是否正确,从而提升模型在逻辑和数学推理任务中的表现。

源自 arXiv: 2605.02395