CMTM:用于无监督视频目标分割的跨模态令牌调制 / CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation
1️⃣ 一句话总结
这篇论文提出了一种名为跨模态令牌调制的新方法,通过加强视频中外观和运动两种信息之间的交互,并引入令牌掩码策略来提升学习效率,从而在无监督视频目标分割任务上取得了当前最好的性能。
Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.
CMTM:用于无监督视频目标分割的跨模态令牌调制 / CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation
这篇论文提出了一种名为跨模态令牌调制的新方法,通过加强视频中外观和运动两种信息之间的交互,并引入令牌掩码策略来提升学习效率,从而在无监督视频目标分割任务上取得了当前最好的性能。
源自 arXiv: 2604.14630