菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-11
📄 Abstract - Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously modeling spatial relationships within video frames and understanding the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA task. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants. The code is publicly available at this https URL.

顶级标签: multi-modal video agents
详细标签: video question answering spatiotemporal reasoning tool-augmented agents large multimodal models benchmark evaluation 或 搜索:

STAR:一种用于视频问答的时空推理框架 / Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task


1️⃣ 一句话总结

本文提出了一种名为STAR的免训练、用户友好的智能体推理框架,通过为大型多模态模型配备一个全面的视频工具包,并采用时空工具交替调用的策略,来渐进式定位视频中的关键三维区域,从而显著提升了复杂视频问答任务的准确性和效率。


2️⃣ 论文创新点

1. 综合视频工具包

2. 时空推理框架(STAR)

3. 时空工具交错策略

4. 3D RoI定位机制


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

源自 arXiv: 2512.10359