菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-03
📄 Abstract - VCIFBench: Evaluating Complex Instruction Following for Video Understanding

Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We introduce VCIFBench, a benchmark for evaluating complex instruction following in video understanding. VCIFBench constructs constraint-rich instructions from both benchmark-adapted and directly video-grounded prompts, covering content, format, style, and structure requirements, and evaluates model outputs with a hybrid verification pipeline. The benchmark contains 306 satisfiable test instructions, a 540-pair DPO preference dataset, and a 30-item conflict diagnostic subset. Experiments on 10 MLLMs show that joint constraint satisfaction remains challenging. We further show that DPO training on VCIFBench data can improve instruction-following performance.

顶级标签: multi-modal model evaluation video
详细标签: video understanding instruction following benchmark constraints dpo 或 搜索:

VCIFBench:评估视频理解中的复杂指令遵循能力 / VCIFBench: Evaluating Complex Instruction Following for Video Understanding


1️⃣ 一句话总结

该论文提出了VCIFBench基准测试,专门用来评估多模态大模型在视频理解任务中,是否能够准确遵循包含内容、格式、风格和结构等多重约束的复杂指令,并通过实验发现现有模型在这方面的表现仍有不足,而使用该基准数据微调可以提升模型性能。

源自 arXiv: 2606.04588