菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-20
📄 Abstract - FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (this https URL) and datasets (this https URL).

顶级标签: multi-modal benchmark model evaluation
详细标签: future forecasting audio-visual reasoning instruction tuning multimodal llms temporal reasoning 或 搜索:

FutureOmni:评估多模态大语言模型基于全模态上下文进行未来预测的能力 / FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs


1️⃣ 一句话总结

这篇论文提出了首个用于评估多模态大模型根据音频和视频线索预测未来事件能力的基准测试FutureOmni,发现现有模型在此任务上表现不佳,并提出了一个有效的训练策略来提升其预测能力。

源自 arXiv: 2601.13836