第五届PVUW MeViS-Audio赛道第二名:ASR-SaSaSa2VA / 2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA
1️⃣ 一句话总结
提出一种资源高效的音频引导视频物体分割方法,通过将音频转为文本描述并利用现存文本分割模型,同时加入音频异常检测模块来过滤无关指令,从而在节省算力和数据的同时获得优秀性能。
Audio-based video object segmentation aims to locate and segment objects in videos conditioned on audio cues, requiring precise understanding of both appearance and motion. Recent audio-driven video segmentation methods extend MLLMs by fusing audio and visual features for end-to-end localization. Despite their promise, these approaches are computationally intensive, struggle with aligning temporal audio cues to dynamic video content, and depend on large paired audio-video datasets. To address these challenges, we present ASR-SaSaSa2VA, a resource-efficient framework for audio-guided video segmentation. The key idea is to convert audio inputs into textual motion descriptions via automatic speech recognition (ASR) models and then leverage pre-trained text-based referring video segmentation models (e.g., SaSaSa2VA) for pixel-level predictions. To further enhance robustness, we incorporate a no-target expression detection module, implemented by a fine-tuned audio-based MLLM, which filters out audio clips that do not refer to any target object. This design allows the system to exploit strong pre-trained models while effectively handling ambiguous or irrelevant audio inputs. Our approach achieves a final score of 80.7 in the 5th PVUW Challenge (MeViS-v2-Audio track), earning the second-place ranking.
第五届PVUW MeViS-Audio赛道第二名:ASR-SaSaSa2VA / 2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA
提出一种资源高效的音频引导视频物体分割方法,通过将音频转为文本描述并利用现存文本分割模型,同时加入音频异常检测模块来过滤无关指令,从而在节省算力和数据的同时获得优秀性能。
源自 arXiv: 2604.23935