菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-03
📄 Abstract - A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the quality of a per-video summary can change with the video's input slot even when the underlying content is unchanged. We construct a benchmark from ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. We evaluate nine open-source and proprietary MLLMs and measure position effects with three complementary metrics: Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG). Our results show that positional effects are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance. We further analyze prompt-level mitigation methods. Together, the results show that multi-video summarization remains sensitive to input protocol and position, motivating more robust order-invariant multimodal systems.

顶级标签: multi-modal llm benchmark
详细标签: positional bias video summarization evaluation multimodal large language models 或 搜索:

多视频摘要中位置偏差的系统性评估——基于多模态大语言模型 / A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs


1️⃣ 一句话总结

本研究发现,在使用多模态大语言模型(MLLMs)对多个视频进行摘要时,模型会因视频输入顺序不同而产生质量差异(即位置偏见),这种偏见因视频类型和模型而异,且简单增加计算资源无法消除,急需开发更鲁棒的、对输入顺序不敏感的模型。

源自 arXiv: 2606.04596