菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-02
📄 Abstract - Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite

Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean). Reference-based COMET on MMLU against human-edited samples points in the same direction. We release cleaned/corrected versions of the EU20 datasets, and code for reproducibility. In sum, automated quality assurance offers practical, scalable indicators that help prioritize review -- complementing, not replacing, human gold standards.

顶级标签: llm benchmark natural language processing
详细标签: translation quality automated evaluation dataset cleaning comet metric multilingual benchmarks 或 搜索:

诊断翻译基准:对EU20基准套件的自动化质量保证研究 / Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite


1️⃣ 一句话总结

这篇论文通过一套自动化质量保证方法,系统地评估了机器翻译基准数据集的质量,发现翻译质量较低的基准数据集中包含更多错误,并发布了清洗后的数据集和工具,为大规模验证翻译可靠性提供了实用方案。

源自 arXiv: 2604.01957