菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-11
📄 Abstract - The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at this https URL .

顶级标签: llm benchmark model evaluation
详细标签: factuality evaluation multimodal assessment knowledge recall tool-augmented reasoning automated scoring 或 搜索:

FACTS排行榜:一个用于全面评估大语言模型事实准确性的在线基准套件 / The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality


1️⃣ 一句话总结

本文介绍了FACTS Leaderboard,一个整合了四个独立子基准的综合性在线评估平台,旨在通过多维度、标准化的方式全面衡量大语言模型在各种场景下生成事实准确文本的能力。


2️⃣ 论文创新点

1. 综合性多维度评估框架

2. FACTS Score聚合指标

3. FACTS Grounding (v2)改进

4. FACTS Multimodal双决策评估框架

5. FACTS Parametric基准设计

6. 对抗性采样机制


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

源自 arXiv: 2512.10791