菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-17
📄 Abstract - Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65\% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at this https URL .

顶级标签: agents benchmark model evaluation
详细标签: command line interface agent evaluation real-world tasks long-horizon tasks terminal environments 或 搜索:

终端基准测试:在命令行界面中对智能体进行困难、真实任务的基准评估 / Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces


1️⃣ 一句话总结

这篇论文提出了一个名为Terminal-Bench 2.0的困难基准测试,它包含89个源自真实工作流程的命令行任务,用于评估AI智能体在复杂、现实场景中的能力,结果显示当前前沿模型的得分低于65%,并指出了改进方向。

源自 arXiv: 2601.11868