Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

📄 Abstract - Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moment proposals. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: 1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. 2) Complex moment proposals: their performance severely relies on the quality of proposals, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each vision-language pair with diverse granularity and flexible combination for multi-level cross-modal this http URL, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. Finally, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for better moment this http URL show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.

从博弈视角重新思考弱监督视频时间定位 / Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

1️⃣ 一句话总结

本文提出了一种全新的博弈论方法来解决弱监督视频时间定位问题，通过将视频帧和查询词视为博弈中的玩家，并利用多元合作博弈理论学习它们之间多层次的细粒度匹配关系，从而在不依赖复杂候选片段的情况下更精准地定位目标时间区间。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要