在零视听资源场景下利用合成视觉数据引导视听语音识别 / Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data
1️⃣ 一句话总结
这篇论文提出了一种创新方法,通过将静态人脸图像与真实音频合成唇语视频,解决了低资源语言因缺乏标注视频数据而难以开发视听语音识别系统的问题,并在加泰罗尼亚语上验证了该方法的有效性。
Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio. We first evaluate synthetic visual augmentation on Spanish benchmarks, then apply it to Catalan, a language with no annotated audiovisual corpora. We synthesize over 700 hours of talking-head video and fine-tune a pre-trained AV-HuBERT model. On a manually annotated Catalan benchmark, our model achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise. Scalable synthetic video thus offers a viable substitute for real recordings in zero-AV-resource AVSR.
在零视听资源场景下利用合成视觉数据引导视听语音识别 / Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data
这篇论文提出了一种创新方法,通过将静态人脸图像与真实音频合成唇语视频,解决了低资源语言因缺乏标注视频数据而难以开发视听语音识别系统的问题,并在加泰罗尼亚语上验证了该方法的有效性。
源自 arXiv: 2603.08249