DARC:用于大语言模型进化的解耦非对称推理课程 / DARC: Decoupled Asymmetric Reasoning Curriculum for LLM Evolution
1️⃣ 一句话总结
这篇论文提出了一个名为DARC的两阶段自学习框架,通过先训练提问模型生成难度可控的问题,再让一个拥有文档访问权限的教师模型指导无文档访问权限的学生解答模型,有效解决了大语言模型在自我对弈训练中的不稳定性问题,从而在多个推理任务上显著提升了模型性能,且无需人工标注数据。
Self-play with large language models has emerged as a promising paradigm for achieving self-improving artificial intelligence. However, existing self-play frameworks often suffer from optimization instability, due to (i) non-stationary objectives induced by solver-dependent reward feedback for the Questioner, and (ii) bootstrapping errors from self-generated pseudo-labels used to supervise the Solver. To mitigate these challenges, we introduce DARC (Decoupled Asymmetric Reasoning Curriculum), a two-stage framework that stabilizes the self-evolution process. First, we train the Questioner to synthesize difficulty-calibrated questions, conditioned on explicit difficulty levels and external corpora. Second, we train the Solver with an asymmetric self-distillation mechanism, where a document-augmented teacher generates high-quality pseudo-labels to supervise the student Solver that lacks document access. Empirical results demonstrate that DARC is model-agnostic, yielding an average improvement of 10.9 points across nine reasoning benchmarks and three backbone models. Moreover, DARC consistently outperforms all baselines and approaches the performance of fully supervised models without relying on human annotations. The code is available at this https URL.
DARC:用于大语言模型进化的解耦非对称推理课程 / DARC: Decoupled Asymmetric Reasoning Curriculum for LLM Evolution
这篇论文提出了一个名为DARC的两阶段自学习框架,通过先训练提问模型生成难度可控的问题,再让一个拥有文档访问权限的教师模型指导无文档访问权限的学生解答模型,有效解决了大语言模型在自我对弈训练中的不稳定性问题,从而在多个推理任务上显著提升了模型性能,且无需人工标注数据。
源自 arXiv: 2601.13761