VIBEVOICE-ASR技术报告 / VIBEVOICE-ASR Technical Report
1️⃣ 一句话总结
这篇论文提出了一个名为VibeVoice-ASR的通用语音理解框架,它能够一次性处理长达60分钟的音频,将语音识别、说话人分离和时间戳生成整合成一个任务,支持多种语言和混合语言场景,并能通过用户提供的提示信息来提高专业术语和歧义词汇的识别准确率。
This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving accuracy on domain-specific terminology and polyphonic character disambiguation.
VIBEVOICE-ASR技术报告 / VIBEVOICE-ASR Technical Report
这篇论文提出了一个名为VibeVoice-ASR的通用语音理解框架,它能够一次性处理长达60分钟的音频,将语音识别、说话人分离和时间戳生成整合成一个任务,支持多种语言和混合语言场景,并能通过用户提供的提示信息来提高专业术语和歧义词汇的识别准确率。
源自 arXiv: 2601.18184