Beyond Caption-Based Queries for Video Moment Retrieval

📄 Abstract - Beyond Caption-Based Queries for Video Moment Retrieval

In this work, we investigate the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries. For this, we introduce three benchmarks by modifying the textual queries in three public VMR datasets -- i.e., HD-EPIC, YouCook2 and ActivityNet-Captions. Our analysis reveals two key generalization challenges: (i) A language gap, arising from the linguistic under-specification of search queries, and (ii) a multi-moment gap, caused by the shift from single-moment to multi-moment queries. We also identify a critical issue in these architectures -- an active decoder-query collapse -- as a primary cause of the poor generalization to multi-moment instances. We mitigate this issue with architectural modifications that effectively increase the number of active decoder queries. Extensive experiments demonstrate that our approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries. The code, models and data are available in the project webpage: this https URL

超越基于字幕查询的视频片段检索 / Beyond Caption-Based Queries for Video Moment Retrieval

1️⃣ 一句话总结

这篇论文发现，现有基于字幕训练的视频片段检索模型在处理更简洁的搜索查询或多片段查询时性能会显著下降，并通过分析问题根源和修改模型结构，有效提升了模型在这些实际场景下的检索准确率。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要