菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-07
📄 Abstract - Process-of-Thought Reasoning for Videos

Video understanding requires not only recognizing visual content but also performing temporally grounded, multi-step reasoning over long and noisy observations. We propose Process-of-Thought (PoT) Reasoning for Videos, a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps. PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence. The framework is designed to be model-agnostic and can be plugged into existing vision-language backbones, supporting both closed-book reasoning and evidence-augmented reasoning with external tools. We further introduce a unified representation for PoT traces that aligns intermediate decisions with temporal segments, which improves robustness to distractors and reduces hallucinated explanations. Extensive experiments on standard video reasoning tasks demonstrate that PoT consistently improves factual correctness and temporal grounding, while providing interpretable reasoning traces for diagnosis and downstream use.

顶级标签: multi-modal video model evaluation
详细标签: video reasoning process-of-thought temporal grounding interpretability vision-language models 或 搜索:

视频的思维过程推理 / Process-of-Thought Reasoning for Videos


1️⃣ 一句话总结

这篇论文提出了一个名为‘思维过程’的视频推理框架,它将复杂的视频理解任务分解成一系列可验证的步骤,从而让推理过程更清晰、更准确,并减少错误,同时适用于不同的现有模型。

源自 arXiv: 2602.07689