菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-20
📄 Abstract - OSGNet with MLLM Reranking @ Ego4D Episodic Memory Challenge 2026

In this report, we present our champion solutions for the Natural Language Queries and GoalStep tracks of the Ego4D Episodic Memory Challenge at CVPR 2026. Both tracks require accurately localizing temporal segments from long untrimmed egocentric videos. To address these tasks, we propose a reranking-based framework that effectively leverages the strong video-language reasoning capability of multimodal large language model (MLLM) while preserving the efficiency and candidate recall of conventional localization pipelines. Specifically, we first obtain a set of candidate segments from existing localization model OSGNet, and then employ MLLM to select the segment that best matches the given query, thereby refining the final prediction. Ultimately, our method achieved first place in both the Natural Language Queries and GoalStep tracks. Our code can be found at this https URL.

顶级标签: computer vision multi-modal
详细标签: egocentric video temporal localization video-language reasoning reranking ego4d challenge 或 搜索:

OSGNet结合多模态大语言模型重排序:Ego4D情景记忆挑战2026解决方案 / OSGNet with MLLM Reranking @ Ego4D Episodic Memory Challenge 2026


1️⃣ 一句话总结

本文提出了一种结合传统定位模型OSGNet和多模态大语言模型(MLLM)的两阶段框架,先快速生成候选视频片段,再让MLLM从中选出最匹配自然语言查询的目标片段,从而在无需大量计算的前提下显著提升第一人称长视频中事件定位的准确性,并在两项国际竞赛中夺得第一。

源自 arXiv: 2605.20818