菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-13
📄 Abstract - Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories

Monte Carlo Tree Search (MCTS) has been widely used for automated reasoning data exploration, but current supervision extraction methods remain inefficient. Standard approaches retain only the single highest-reward trajectory, discarding the comparative signals present in the many explored paths. Here we introduce \textbf{Contrastive Reasoning Path Synthesis (CRPS)}, a framework that transforms supervision extraction from a filtering process into a synthesis procedure. CRPS uses a structured reflective process to analyze the differences between high- and low-quality search trajectories, extracting explicit information about strategic pivots and local failure modes. These insights guide the synthesis of reasoning chains that incorporate success patterns while avoiding identified pitfalls. We show empirically that models fine-tuned on just 60K CRPS-synthesized examples match or exceed the performance of baselines trained on 590K examples derived from standard rejection sampling, a 20$\times$ reduction in dataset size. Furthermore, CRPS improves generalization on out-of-domain benchmarks, demonstrating that learning from the contrast between success and failure produces more transferable reasoning capabilities than learning from success alone.

顶级标签: llm model training agents
详细标签: reasoning synthesis contrastive learning monte carlo tree search data efficiency automated reasoning 或 搜索:

从对比中学习:基于多样化搜索轨迹合成推理路径 / Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories


1️⃣ 一句话总结

这篇论文提出了一个名为CRPS的新框架,它通过对比分析人工智能搜索过程中成功与失败的路径差异,自动合成高质量的推理训练数据,从而用极少量数据就能训练出泛化能力更强的推理模型。

源自 arXiv: 2604.11365