菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-15
📄 Abstract - Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.

顶级标签: llm model training agents
详细标签: reinforcement learning reasoning models cascaded rl alignment benchmark evaluation 或 搜索:

Nemotron-Cascade:为通用推理模型扩展级联强化学习 / Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models


1️⃣ 一句话总结

这篇论文提出了一种名为“级联强化学习”的新方法,通过分领域、分阶段地训练AI模型,有效解决了通用推理模型在训练中面临的复杂性和效率问题,最终训练出的模型在多项编程和推理基准测试中超越了现有先进模型。


源自 arXiv: 2512.13607