MLLM-as-a-Judge Exhibits Model Preference Bias

📄 Abstract - MLLM-as-a-Judge Exhibits Model Preference Bias

Automatic evaluation using multimodal large language models (MLLMs), commonly referred to as MLLM-as-a-Judge, has been widely used to measure model performance. If such MLLM-as-a-Judge methods were biased, they could distort model comparisons and benchmark-driven scientific progress. However, it remains unclear to what extent MLLM-as-a-Judge methods favor or disfavor text generated by specific MLLMs. In this study, we propose Philautia-Eval to investigate such model-specific preference bias. Philautia-Eval quantifies the degree of the bias by disentangling preference tendencies from differences in generation quality. Using 1.29M caption-score pairs collected from 12 MLLMs, we found that representative MLLMs tend to exhibit self-preference bias. Moreover, experimental results indicate mutual preference bias within particular model families, which is potentially driven by reused connectors and overlapping instruction-tuning resources. Finally, we introduce a simple ensemble of MLLMs, Pomms. Our results demonstrated that Pomms effectively mitigated the model-specific preference bias while maintaining performance.

MLLM作为评判者表现出模型偏好偏见 / MLLM-as-a-Judge Exhibits Model Preference Bias

1️⃣ 一句话总结

这项研究发现，使用多模态大语言模型自动评估其他模型时，会存在明显的‘自恋’偏见，即倾向于给与自己同源或相似的模型打高分，从而可能扭曲模型比较结果，而作者提出的简单集成方法能有效缓解这种偏见。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要