边看边想:面向大型视觉语言模型的流式思维链推理 / Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models
1️⃣ 一句话总结
这篇论文提出了一个名为‘边看边想’的新框架,让大型视觉语言模型能够像人类看视频流一样,一边接收图像帧一边实时进行推理,从而在保持高准确率的同时,大幅提升了处理视频的响应速度和效率。
Large Vision Language Models (LVLMs) exhibit strong Chain-of-Thought (CoT) capabilities, yet most existing paradigms assume full-video availability before inference, a batch-style process misaligned with real-world video streams where information arrives sequentially. Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs. The first, an interleaved paradigm, alternates between receiving frames and producing partial reasoning but remains constrained by strictly ordered cache updates. To better match streaming inputs, we propose \textbf{Think-as-You-See (TaYS)}, a unified framework enabling true concurrent reasoning. TaYS integrates parallelized CoT generation, stream-constrained training, and stream-parallel inference. It further employs temporally aligned reasoning units, streaming attention masks and positional encodings, and a dual KV-cache that decouples visual encoding from textual reasoning. We evaluate all paradigms on the Qwen2.5-VL family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experiments show that TaYS consistently outperforms both batch and interleaved baselines, improving reasoning performance while substantially reducing time-to-first-token (TTFT) and overall reasoning delay. These results demonstrate the effectiveness of data-aligned streaming reasoning in enabling efficient and responsive video understanding for LVLMs. We release our code at \href{this https URL}{this repository.}
边看边想:面向大型视觉语言模型的流式思维链推理 / Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models
这篇论文提出了一个名为‘边看边想’的新框架,让大型视觉语言模型能够像人类看视频流一样,一边接收图像帧一边实时进行推理,从而在保持高准确率的同时,大幅提升了处理视频的响应速度和效率。
源自 arXiv: 2603.02872