FinForge:半合成金融基准测试生成 / FinForge: Semi-Synthetic Financial Benchmark Generation
1️⃣ 一句话总结
这篇论文提出了一个名为FinForge的半自动化框架,它通过结合专家知识和AI生成技术,创建了一个高质量、大规模的金融领域测试集,用于更准确地评估语言模型在需要专业知识和严谨计算的金融推理任务上的真实能力。
Evaluating Language Models (LMs) in specialized, high-stakes domains such as finance remains a significant challenge due to the scarcity of open, high-quality, and domain-specific datasets. Existing general-purpose benchmarks provide broad coverage but lack the depth and domain fidelity needed to assess LMs' capabilities for real-world financial reasoning, which requires both conceptual understanding and quantitative rigor. To address this gap, we introduce FinForge, a scalable, semi-synthetic pipeline for constructing finance-specific evaluation benchmarks through a hybrid of expert-guided data curation and controlled LM-based synthesis. FinForge combines manual and programmatic corpus construction from authoritative financial sources with structured question generation and validation using Gemini 2.5 Flash. To demonstrate the pipeline's efficacy, we produce FinForge-5k, a snapshot benchmark comprising over 5,000 human-validated question-answer pairs across 11 finance subdomains, derived from a curated corpus of 100,000 verified documents totaling 143M tokens. Evaluation of state-of-the-art open-source and closed-source models on FinForge-5k reveals significant differences in financial reasoning, with leading models achieving accuracy levels near 80%. These findings underscore the framework's utility for diagnosing current model limitations and guiding future improvements in financial domain competence. All code and data are available at this https URL.
FinForge:半合成金融基准测试生成 / FinForge: Semi-Synthetic Financial Benchmark Generation
这篇论文提出了一个名为FinForge的半自动化框架,它通过结合专家知识和AI生成技术,创建了一个高质量、大规模的金融领域测试集,用于更准确地评估语言模型在需要专业知识和严谨计算的金融推理任务上的真实能力。
源自 arXiv: 2601.06747