Streaming Video Instruction Tuning

📄 Abstract - Streaming Video Instruction Tuning

We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.

流式视频指令微调 / Streaming Video Instruction Tuning

1️⃣ 一句话总结

这篇论文提出了一个名为Streamo的实时流式视频大语言模型，它通过构建一个大规模指令数据集进行训练，能够像通用助手一样实时处理视频流中的多种任务，例如实时解说、动作理解和时间敏感问答，从而弥合了传统离线视频分析与实时智能交互之间的差距。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要