菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-21
📄 Abstract - MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

Training Vision Language Models (VLMs) for video event reasoning requires high-quality structured annotations capturing not only what happened, but when, where, why, and with what consequence, at a scale manual labelling cannot support. We present MAVEN (Multi-stage Agentic Video Event aNnotation), a multi-stage agentic pipeline that turns raw videos into multi-task training data with Chain-of-Thought (CoT) reasoning traces, organized around a designated Event of Focus. At its core, MAVEN synthesizes a Multi-Scale Spatio-Temporal Event Description (MSTED) from three complementary caption levels; this explicit intermediate serves as the sole input to downstream Q&A generation across multiple task formats. Crucially, MAVEN supports agent-driven domain adaptation: given a new video dataset and target question examples, the agent redesigns all prompts top-down without manual re-engineering. A hierarchical refinement loop further classifies annotation errors against a taxonomy, traces root causes to the originating pipeline stage, and applies targeted edits that rewrite prompts or modify the pipeline structure itself, iteratively improving data quality. We apply MAVEN to label over 5,300 traffic videos and fine-tune Cosmos-Reason2-8B on the resulting data. On a private CCTV evaluation set, fine-tuning surpasses both Gemini 2.5 Pro and 3.1 Flash, including a $+38.8$-point gain in MCQ accuracy over zero-shot. On AccidentBench, CCTV-only training lifts Cosmos-Reason2 by $+10.7$ MCQ points and matches Gemini 2.5 Pro despite seeing no dashcam videos; adding agent-adapted dashcam annotations narrows the gap to Gemini 3.1 Flash, and RL post-training pushes overall performance past both Gemini baselines. Qualitative results on warehouse surveillance and public safety videos further show the agentic workflow readily adapts the pipeline to new domains.

顶级标签: multi-modal agents model training
详细标签: video reasoning vision language model chain-of-thought domain adaptation annotations 或 搜索:

MAVEN:一种面向视频推理任务的多阶段智能体标注流水线 / MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks


1️⃣ 一句话总结

本文提出了一种名为MAVEN的自动化流水线,它像一位智能导演一样,将原始视频自动分解为包含时间、地点、原因和后果的详细事件描述,并基于此生成高质量的训练数据,从而使小模型仅用交通视频训练就能在多个复杂视频推理任务上超越GPT级别的大模型。

源自 arXiv: 2605.21917