📄 论文总结
多标准:在多标准遵循上对多模态评判模型进行基准测试 / Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
1️⃣ 一句话总结
这篇论文提出了一个名为Multi-Crit的基准测试,用于评估多模态模型在遵循多样化、细粒度评价标准方面的能力,发现现有模型在灵活遵循多标准和保持一致性方面仍有明显不足,为构建更可靠的多模态AI评估系统奠定了基础。
Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria--especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.
多标准:在多标准遵循上对多模态评判模型进行基准测试 / Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
这篇论文提出了一个名为Multi-Crit的基准测试,用于评估多模态模型在遵循多样化、细粒度评价标准方面的能力,发现现有模型在灵活遵循多标准和保持一致性方面仍有明显不足,为构建更可靠的多模态AI评估系统奠定了基础。