2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA

📄 Abstract - 2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA

Audio-based video object segmentation aims to locate and segment objects in videos conditioned on audio cues, requiring precise understanding of both appearance and motion. Recent audio-driven video segmentation methods extend MLLMs by fusing audio and visual features for end-to-end localization. Despite their promise, these approaches are computationally intensive, struggle with aligning temporal audio cues to dynamic video content, and depend on large paired audio-video datasets. To address these challenges, we present ASR-SaSaSa2VA, a resource-efficient framework for audio-guided video segmentation. The key idea is to convert audio inputs into textual motion descriptions via automatic speech recognition (ASR) models and then leverage pre-trained text-based referring video segmentation models (e.g., SaSaSa2VA) for pixel-level predictions. To further enhance robustness, we incorporate a no-target expression detection module, implemented by a fine-tuned audio-based MLLM, which filters out audio clips that do not refer to any target object. This design allows the system to exploit strong pre-trained models while effectively handling ambiguous or irrelevant audio inputs. Our approach achieves a final score of 80.7 in the 5th PVUW Challenge (MeViS-v2-Audio track), earning the second-place ranking.

第五届PVUW MeViS-Audio赛道第二名：ASR-SaSaSa2VA / 2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA

1️⃣ 一句话总结

提出一种资源高效的音频引导视频物体分割方法，通过将音频转为文本描述并利用现存文本分割模型，同时加入音频异常检测模块来过滤无关指令，从而在节省算力和数据的同时获得优秀性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要