菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-16
📄 Abstract - JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.

顶级标签: multi-modal benchmark model evaluation
详细标签: multimodal understanding visual question answering japanese language image generation benchmark construction 或 搜索:

JMMMU-Pro:通过氛围基准构建方法建立的基于图像的日本多学科多模态理解基准 / JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction


1️⃣ 一句话总结

这篇论文提出了一个名为JMMMU-Pro的日本多学科图像理解测试集,以及一个高效构建此类测试集的“氛围基准构建”方法,该方法利用先进图像生成模型自动生成题目图像并由人工校验,旨在更严格地评估大型多模态模型在日语环境下的综合图文理解能力。


源自 arXiv: 2512.14620