强化学习基础模型应当已成现实 / Reinforcement Learning Foundation Models Should Already Be A Thing
1️⃣ 一句话总结
本文指出,如同表格预测领域利用合成数据成功构建基础模型一样,强化学习也能通过合成马尔可夫决策过程(MDP)来预训练一个通用的上下文学习模型,并用实验证明该模型无需微调即可高效解决在线和离线任务。
Foundation models for language and vision are powered by internet-scale data, while structured domains (tabular prediction, time-series forecasting, graph learning, reinforcement learning) are not. The substitute is synthetic data, which shifts the burden from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbf{First}, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbf{Second}, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train one model entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB.
强化学习基础模型应当已成现实 / Reinforcement Learning Foundation Models Should Already Be A Thing
本文指出,如同表格预测领域利用合成数据成功构建基础模型一样,强化学习也能通过合成马尔可夫决策过程(MDP)来预训练一个通用的上下文学习模型,并用实验证明该模型无需微调即可高效解决在线和离线任务。
源自 arXiv: 2606.18812