菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-15
📄 Abstract - FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at this https URL. Supplementary resources are released in a separate repository at this https URL.

顶级标签: llm benchmark model evaluation
详细标签: multilingual evaluation benchmarking language model assessment task robustness huggingface datasets 或 搜索:

FIN-bench-v2:一个用于评估芬兰语大语言模型的统一且鲁棒的基准测试套件 / FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models


1️⃣ 一句话总结

这篇论文提出了一个名为FIN-bench-v2的综合性基准测试套件,它通过整合多种芬兰语评测任务并引入严格的鲁棒性筛选标准,为客观、可靠地评估芬兰语大语言模型的性能提供了一个统一且高质量的公共平台。


源自 arXiv: 2512.13330