智能体作为裁判 / Agent-as-a-Judge
1️⃣ 一句话总结
这篇论文系统性地总结了人工智能评估领域从‘大语言模型作为裁判’向‘智能体作为裁判’的范式转变,指出后者通过规划、工具验证和多智能体协作等方式,能对复杂任务进行更可靠、可验证的评估,并为此领域建立了首个全面的发展框架和研究路线图。
LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.
智能体作为裁判 / Agent-as-a-Judge
这篇论文系统性地总结了人工智能评估领域从‘大语言模型作为裁判’向‘智能体作为裁判’的范式转变,指出后者通过规划、工具验证和多智能体协作等方式,能对复杂任务进行更可靠、可验证的评估,并为此领域建立了首个全面的发展框架和研究路线图。
源自 arXiv: 2601.05111