📄
Abstract - DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints
While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.
DeepPlanning:一个具有可验证约束的长周期智能体规划基准测试 /
DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints
1️⃣ 一句话总结
这篇论文提出了一个名为DeepPlanning的新基准测试,它通过模拟需要主动收集信息、处理细节约束并进行全局优化的多日旅行和购物任务,来挑战当前最先进的AI智能体在真实长周期规划中的能力,揭示了它们在此类复杂规划中的不足,并指出了改进方向。