菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-23
📄 Abstract - Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.

顶级标签: reinforcement learning agents model training
详细标签: hierarchical rl autoregressive models temporal abstraction latent actions internal reinforcement learning 或 搜索:

自回归模型中涌现的时间抽象能力实现分层强化学习 / Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning


1️⃣ 一句话总结

这篇论文提出了一种名为‘内部强化学习’的新方法,通过让模型在其内部表示层面直接学习并执行一连串有意义的‘动作组合’(而非单个动作),从而解决了传统方法在奖励稀疏时学习效率低下的问题,使得大型预训练模型能更高效地完成复杂的层次化任务。

源自 arXiv: 2512.20605