TwiFF(与未来帧共思):用于动态视觉推理的大规模数据集 / TwiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning
1️⃣ 一句话总结
这篇论文提出了首个用于动态视觉问答的大规模数据集TwiFF-2.7M和评估基准TwiFF-Bench,并开发了一个能通过生成未来视频帧来辅助推理的模型,显著提升了AI在理解动态视频内容并进行复杂推理方面的能力。
Visual Chain-of-Thought (VCoT) has emerged as a promising paradigm for enhancing multimodal reasoning by integrating visual perception into intermediate reasoning steps. However, existing VCoT approaches are largely confined to static scenarios and struggle to capture the temporal dynamics essential for tasks such as instruction, prediction, and camera motion. To bridge this gap, we propose TwiFF-2.7M, the first large-scale, temporally grounded VCoT dataset derived from $2.7$ million video clips, explicitly designed for dynamic visual question and answer. Accompanying this, we introduce TwiFF-Bench, a high-quality evaluation benchmark of $1,078$ samples that assesses both the plausibility of reasoning trajectories and the correctness of final answers in open-ended dynamic settings. Building on these foundations, we propose the TwiFF model, a unified modal that synergistically leverages pre-trained video generation and image comprehension capabilities to produce temporally coherent visual reasoning cues-iteratively generating future action frames and textual reasoning. Extensive experiments demonstrate that TwiFF significantly outperforms existing VCoT methods and Textual Chain-of-Thought baselines on dynamic reasoning tasks, which fully validates the effectiveness for visual question answering in dynamic scenarios. Our code and data is available at this https URL.
TwiFF(与未来帧共思):用于动态视觉推理的大规模数据集 / TwiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning
这篇论文提出了首个用于动态视觉问答的大规模数据集TwiFF-2.7M和评估基准TwiFF-Bench,并开发了一个能通过生成未来视频帧来辅助推理的模型,显著提升了AI在理解动态视频内容并进行复杂推理方面的能力。
源自 arXiv: 2602.10675