菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-20
📄 Abstract - MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at this https URL.

顶级标签: llm benchmark multi-modal
详细标签: mathematical reasoning multilingual dataset retrieval benchmark olympiad problems retrieval-augmented generation 或 搜索:

MathNet:一个用于数学推理与检索的全球多模态基准 / MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval


1️⃣ 一句话总结

这篇论文推出了一个名为MathNet的大型、高质量、多语言和多模态的奥林匹克数学竞赛数据集及评测基准,用于全面评估AI模型在数学问题求解、数学感知检索以及检索增强解题方面的能力,结果显示当前最先进的AI模型在这些任务上仍面临巨大挑战。

源自 arXiv: 2604.18584