菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-16
📄 Abstract - S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation

In recent years, the state-of-the-art in unsupervised video instance segmentation has heavily relied on synthetic video data, generated from object-centric image datasets such as ImageNet. However, video synthesis by artificially shifting and scaling image instance masks fails to accurately model realistic motion in videos, such as perspective changes, movement by parts of one or multiple instances, or camera motion. To tackle this issue, we propose an unsupervised video instance segmentation model trained exclusively on real video data. We start from unsupervised instance segmentation masks on individual video frames. However, these single-frame segmentations exhibit temporal noise and their quality varies through the video. Therefore, we establish temporal coherence by identifying high-quality keymasks in the video by leveraging deep motion priors. The sparse keymask pseudo-annotations are then used to train a segmentation model for implicit mask propagation, for which we propose a Sparse-To-Dense Distillation approach aided by a Temporal DropLoss. After training the final model on the resulting dense labelset, our approach outperforms the current state-of-the-art across various benchmarks.

顶级标签: computer vision video model training
详细标签: video instance segmentation unsupervised learning mask propagation temporal coherence distillation 或 搜索:

S2D:用于无监督视频实例分割的稀疏到稠密关键掩码蒸馏 / S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation


1️⃣ 一句话总结

这篇论文提出了一种仅使用真实视频数据训练的新方法,通过识别高质量的关键帧分割掩码并利用它们来指导模型学习,从而在无需人工标注的情况下,实现了比现有方法更优的视频物体分割效果。


源自 arXiv: 2512.14440