📄
Abstract - OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models
AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LWM-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.
OccuBench:通过语言世界模型评估AI智能体在现实世界专业任务上的表现 /
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models
1️⃣ 一句话总结
这篇论文提出了一个名为OccuBench的基准测试,它利用语言世界模型模拟专业环境,首次系统地评估了AI智能体在10个行业、65个专业领域的100个真实任务场景中的表现,并发现不同模型在不同行业各有所长,且处理隐含数据错误比显式错误更具挑战性。