以场景为中心的无监督视频全景分割 / Scene-Centric Unsupervised Video Panoptic Segmentation
1️⃣ 一句话总结
本文提出了首个无需人工标注的无监督视频全景分割方法VideoCUPS,通过利用视频中的深度、运动和视觉线索自动生成伪标签,并设计新型损失函数训练模型,在多个基准上显著超越了现有方法,为无监督视频理解开辟了新方向。
Video panoptic segmentation (VPS) aims to jointly detect, segment, and track all objects while partitioning the video into semantically consistent regions. We introduce the task setting of unsupervised VPS, omitting any human supervision. Existing unsupervised scene understanding works mainly focused on image segmentation tasks; the video domain remains underexplored. We propose VideoCUPS, the first unsupervised VPS approach. VideoCUPS generates temporally consistent panoptic video pseudo-labels from scene-centric videos by exploiting unsupervised depth, motion, and visual cues. Training on these pseudo-labels using a novel Video DropLoss yields an accurate, unsupervised VPS model. To benchmark progress, we introduce a comprehensive evaluation protocol and four competitive baselines, extending state-of-the-art unsupervised panoptic image and instance video segmentation models to VPS. VideoCUPS outperforms all baselines and demonstrates strong label-efficient learning. With VideoCUPS, our evaluation protocol, and baselines, we provide a strong foundation for future research on unsupervised VPS.
以场景为中心的无监督视频全景分割 / Scene-Centric Unsupervised Video Panoptic Segmentation
本文提出了首个无需人工标注的无监督视频全景分割方法VideoCUPS,通过利用视频中的深度、运动和视觉线索自动生成伪标签,并设计新型损失函数训练模型,在多个基准上显著超越了现有方法,为无监督视频理解开辟了新方向。
源自 arXiv: 2606.04925