菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-11
📄 Abstract - Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the \textbf{O}utcome-based \textbf{P}rocess \textbf{V}erifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \textsc{\thisbench}, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2\% to 73.3\% on AIME2025 as the compute budget scales.

顶级标签: llm model evaluation agents
详细标签: verification reasoning active learning mathematical problem solving rejection fine-tuning 或 搜索:

用于奥赛级数学问题求解的长程推理智能体 / Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving


1️⃣ 一句话总结

这篇论文提出了一种名为OPV的新型验证器,它通过检查长推理链中总结性结果的推导过程,高效且准确地验证复杂数学问题的求解步骤,并利用主动学习框架以较低成本提升验证能力,从而显著提升了大型语言模型在奥赛级数学问题上的解答性能。


源自 arXiv: 2512.10739