菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-19
📄 Abstract - Training the Orchestrator: A Supervised Approach to End-to-End PDDL Planning with LLM Agents

Translating natural-language planning intent into verified plans is a longstanding challenge: people communicate goals in language, while classical planners require formal PDDL specifications. Recent agentic frameworks bridge this gap by orchestrating a pool of specialized repair agents inside a verifier-checked refinement loop, but the orchestrator at the centre is itself a prompted frontier LLM, paying a frontier-LLM API call at every refinement step. We present HALO (Hybrid Agent-Learned Orchestrator), which trains the orchestrator from refinement trajectories that an external verifier has certified as ending in valid plans, across 11 PDDL domains. HALO pairs a small QLoRA-tuned policy with three hardcoded rules for trivially decidable selections, and operates over an expanded 21-agent action space. Unlike approaches that prompt a frontier LLM at every step or learn an orchestrator from sparse end-of-episode rewards, our key observation is that the verifier already provides strong guidance: every accepted trajectory is a sequence of demonstrably correct (state, agent) decisions, directly usable as supervision. Across PlanBench, Natural Plan, and classical planning benchmarks, HALO matches or exceeds the GPT-5-mini prompted baseline on success rate, sits within three percentage points of the stronger Gemini-3-Flash prompted baseline, reduces orchestration cost by more than an order of magnitude (\$0.18 to \$0.004 per task against GPT-5-mini, roughly 45$\times$ cheaper; roughly 15$\times$ cheaper than Gemini-3-Flash), and cuts total LLM calls per episode by 40 to 50 percent.

顶级标签: llm agents natural language processing
详细标签: pddl planning orchestrator supervised learning cost reduction benchmark 或 搜索:

训练编排器:一种基于监督学习的端到端PDDL规划方法,结合大语言模型智能体 / Training the Orchestrator: A Supervised Approach to End-to-End PDDL Planning with LLM Agents


1️⃣ 一句话总结

本文提出了一种名为HALO的新方法,通过利用验证器提供的正确决策轨迹作为监督信号,训练一个小型语言模型作为编排器,代替昂贵的前沿大模型来协调多个专业修复智能体,从而在保持甚至提升规划成功率的同时,将规划成本降低数十倍,为实现高效且可靠的端到端形式化规划提供了实用方案。

源自 arXiv: 2606.21740