← 返回列表

arXiv 提交日期: 2026-04-16

📄 Abstract - CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.

顶级标签: computer vision multi-modal model training

CMTM：用于无监督视频目标分割的跨模态令牌调制 / CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

1️⃣ 一句话总结

这篇论文提出了一种名为跨模态令牌调制的新方法，通过加强视频中外观和运动两种信息之间的交互，并引入令牌掩码策略来提升学习效率，从而在无监督视频目标分割任务上取得了当前最好的性能。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2604.14630

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要