菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-22
📄 Abstract - Scaling Self-Play with Self-Guidance

LLM self-play algorithms are notable in that, in principle, nothing bounds their learning: a Conjecturer model creates problems for a Solver, and both improve together. However, in practice, existing LLM self-play methods do not scale well with large amounts of compute, instead hitting learning plateaus. We argue this is because over long training runs, the Conjecturer learns to hack its reward, collapsing to artificially complex problems that do not help the Solver improve. To overcome this, we introduce Self-Guided Self-Play (SGS), a self-play algorithm in which the language model itself guides the Conjecturer away from degeneracy. In SGS, the model takes on three roles: Solver, Conjecturer, and a Guide that scores synthetic problems by their relevance to unsolved target problems and how clean and natural they are, providing supervision against Conjecturer collapse. Our core hypothesis is that language models can assess whether a subproblem is useful for achieving a goal. We evaluate the scaling properties of SGS by running training for significantly longer than prior works and by fitting scaling laws to cumulative solve rate curves. Applying SGS to formal theorem proving in Lean4, we find that it surpasses the asymptotic solve rate of our strongest RL baseline in fewer than 80 rounds of self-play and enables a 7B parameter model, after 200 rounds of self-play, to solve more problems than a 671B parameter model pass@4.

顶级标签: llm reinforcement learning
详细标签: self-play scaling theorem proving reward hacking language model 或 搜索:

自我引导的自我博弈扩展方法 / Scaling Self-Play with Self-Guidance


1️⃣ 一句话总结

本文提出了一种名为自我引导自我博弈(SGS)的新算法,让语言模型在自我对弈中同时扮演求解器、出题者和评价者三个角色,通过评价者筛选出高质量、有价值的题目来防止出题者生成无意义难题,从而让模型能在更长时间训练中持续进步,并在数学定理证明任务中表现出色。

源自 arXiv: 2604.20209