菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-23
📄 Abstract - Endless Terminals: Scaling RL Environments for Terminal Agents

Environments are the bottleneck for self-improving agents. Current terminal benchmarks were built for evaluation, not training; reinforcement learning requires a scalable pipeline, not just a dataset. We introduce Endless Terminals, a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation. The pipeline has four stages: generating diverse task descriptions, building and validating containerized environments, producing completion tests, and filtering for solvability. From this pipeline we obtain 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop: no retrieval, multi-agent coordination, or specialized tools. Despite this simplicity, models trained on Endless Terminals show substantial gains: on our held-out dev set, Llama-3.2-3B improves from 4.0% to 18.2%, Qwen2.5-7B from 10.7% to 53.3%, and Qwen3-8B-openthinker-sft from 42.6% to 59.0%. These improvements transfer to human-curated benchmarks: models trained on Endless Terminals show substantial gains on held out human curated benchmarks: on TerminalBench 2.0, Llama-3.2-3B improves from 0.0% to 2.2%, Qwen2.5-7B from 2.2% to 3.4%, and Qwen3-8B-openthinker-sft from 1.1% to 6.7%, in each case outperforming alternative approaches including models with more complex agentic scaffolds. These results demonstrate that simple RL succeeds when environments scale.

顶级标签: agents reinforcement learning benchmark
详细标签: terminal agents environment generation procedural generation ppo scalable training 或 搜索:

无尽终端:为终端智能体扩展强化学习环境 / Endless Terminals: Scaling RL Environments for Terminal Agents


1️⃣ 一句话总结

这篇论文提出了一个名为‘无尽终端’的自动化系统,能够大规模生成用于训练终端操作智能体的多样化任务环境,使得即使采用简单的强化学习方法,也能显著提升模型在终端任务上的表现。

源自 arXiv: 2601.16443