用于语言模型推理的同伴预测自训练 / Peer-Predictive Self-Training for Language Model Reasoning
1️⃣ 一句话总结
这篇论文提出了一种无需外部监督的自训练框架,让多个语言模型通过互相预测和整合答案来协同改进,在数学推理任务上有效提升了准确率并缩小了模型内部差距。
Mechanisms for continued self-improvement of language models without external supervision remain an open challenge. We propose Peer-Predictive Self-Training (PST), a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses in practice, serves as an internal target for learning. We measure how informative each intermediate response is about the aggregate using pointwise mutual information (PMI), and use this signal to scale self-training updates. Responses already aligned with the aggregate are updated less, while less informative or misaligned responses are updated more. On mathematical reasoning benchmarks (SimulEq, Math500, and MultiArith), PST improves exact-match accuracy by 2.2 to 4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap (GV-Gap) by 26 to 40 percent, while requiring no external supervision or teacher-student hierarchy and relying solely on cross-model interactions. These results suggest that cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training.
用于语言模型推理的同伴预测自训练 / Peer-Predictive Self-Training for Language Model Reasoning
这篇论文提出了一种无需外部监督的自训练框架,让多个语言模型通过互相预测和整合答案来协同改进,在数学推理任务上有效提升了准确率并缩小了模型内部差距。
源自 arXiv: 2604.13356