菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-27
📄 Abstract - TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher's effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from short to long with a curriculum schedule. Experimental results across four student-teacher pairs on three multi-turn agent benchmarks (ALFWorld, WebShop, ScienceWorld) show that TCOD mitigates KL escalation and enhances KL stability throughout training, improving agent performance by up to 18 points over vanilla OPD. Further evaluations show that TCOD can even surpass the teacher's performance and generalize to tasks on which the teacher fails. Our code is available at this https URL.

顶级标签: agents reinforcement learning llm
详细标签: on-policy distillation curriculum learning multi-turn agents kl divergence benchmark 或 搜索:

TCOD:多轮自主智能体在线策略蒸馏中的时间课程探索 / TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents


1️⃣ 一句话总结

本文提出了一种名为TCOD的简单有效方法,通过循序渐进的课程式训练策略,让小型模型在模仿大型教师模型处理多步骤任务(如操作虚拟环境和在线购物)时,避免因错误积累导致的训练不稳定,从而显著提升其成功率,甚至在某些任务上超越教师模型。

源自 arXiv: 2604.24005