JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

📄 Abstract - JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: \textbf{Is this complexity necessary?} We present \textbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9\% and 64.3\% average accuracy across nine mathematical benchmarks) while using 2$\times$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.

JustRL：用简单的强化学习配方扩展15亿参数大语言模型 / JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

1️⃣ 一句话总结

这篇论文提出了一个名为JustRL的极简强化学习方法，它仅使用单阶段训练和固定参数，就在两个15亿参数模型上取得了顶尖的数学推理性能，同时计算量减半，挑战了当前领域普遍认为需要复杂训练流程才能取得好效果的观念。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要