FormalJudge:一种用于智能体监督的神经符号范式 / FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight
1️⃣ 一句话总结
这篇论文提出了一种名为FormalJudge的新方法,它结合了大型语言模型和形式化验证技术,将人类意图转化为可数学证明的约束条件,从而为AI智能体的行为安全提供比传统概率性评估更可靠的保障。
As LLM-based agents increasingly operate in high-stakes domains with real-world consequences, ensuring their behavioral safety becomes paramount. The dominant oversight paradigm, LLM-as-a-Judge, faces a fundamental dilemma: how can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes? We argue that formal verification offers a principled escape from this dilemma, yet its adoption has been hindered by a critical bottleneck: the translation from natural language requirements to formal specifications. This paper bridges this gap by proposing , a neuro-symbolic framework that employs a bidirectional Formal-of-Thought architecture: LLMs serve as specification compilers that top-down decompose high-level human intent into atomic, verifiable constraints, then bottom-up prove compliance using Dafny specifications and Z3 Satisfiability modulo theories solving, which produces mathematical guarantees rather than probabilistic scores. We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection. Experiments on 7 agent models demonstrate that achieves an average improvement of 16.6% over LLM-as-a-Judge baselines, enables weak-to-strong generalization where a 7B judge achieves over 90% accuracy detecting deception from 72B agents, and provides near-linear safety improvement through iterative refinement.
FormalJudge:一种用于智能体监督的神经符号范式 / FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight
这篇论文提出了一种名为FormalJudge的新方法,它结合了大型语言模型和形式化验证技术,将人类意图转化为可数学证明的约束条件,从而为AI智能体的行为安全提供比传统概率性评估更可靠的保障。
源自 arXiv: 2602.11136