A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

📄 Abstract - A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the quality of a per-video summary can change with the video's input slot even when the underlying content is unchanged. We construct a benchmark from ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. We evaluate nine open-source and proprietary MLLMs and measure position effects with three complementary metrics: Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG). Our results show that positional effects are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance. We further analyze prompt-level mitigation methods. Together, the results show that multi-video summarization remains sensitive to input protocol and position, motivating more robust order-invariant multimodal systems.

多视频摘要中位置偏差的系统性评估——基于多模态大语言模型 / A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

1️⃣ 一句话总结

本研究发现，在使用多模态大语言模型（MLLMs）对多个视频进行摘要时，模型会因视频输入顺序不同而产生质量差异（即位置偏见），这种偏见因视频类型和模型而异，且简单增加计算资源无法消除，急需开发更鲁棒的、对输入顺序不敏感的模型。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要