降低泛化税:关于大语言模型智能体强化学习训练的跨领域泛化研究 / Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents
1️⃣ 一句话总结
这项研究发现,在训练大语言模型智能体时,强化学习环境的‘状态信息丰富度’和‘规划复杂度’是影响其跨领域泛化能力的关键因素,而增加一些与目标无关的干扰信息可以有效提升泛化鲁棒性。
Generalist LLM agents are often post-trained on a narrow set of environments but deployed across far broader, unseen domains. In this work, we investigate the challenge of agentic post-training when the eventual test domains are unknown. Specifically, we analyze which properties of reinforcement learning (RL) environments and modeling choices have the greatest influence on out-of-domain performance. First, we identify two environment axes that strongly correlate with cross-domain generalization: (i) state information richness, i.e., the amount of information for the agent to process from the state, and (ii) planning complexity, estimated via goal reachability and trajectory length under a base policy. Notably, domain realism and text-level similarity are not the primary factors; for instance, the simple grid-world domain Sokoban leads to even stronger generalization in SciWorld than the more realistic ALFWorld. Motivated by these findings, we further show that increasing state information richness alone can already effectively improve cross-domain robustness. We propose a randomization technique, which is low-overhead and broadly applicable: add small amounts of distractive goal-irrelevant features to the state to make it richer without altering the task. Beyond environment-side properties, we also examine several modeling choices: (a) SFT warmup or mid-training helps prevent catastrophic forgetting during RL but undermines generalization to domains that are not included in the mid-training datamix; and (b) turning on step-by-step thinking during RL, while not always improving in-domain performance, plays a crucial role in preserving generalization.
降低泛化税:关于大语言模型智能体强化学习训练的跨领域泛化研究 / Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents
这项研究发现,在训练大语言模型智能体时,强化学习环境的‘状态信息丰富度’和‘规划复杂度’是影响其跨领域泛化能力的关键因素,而增加一些与目标无关的干扰信息可以有效提升泛化鲁棒性。
源自 arXiv: 2601.18217