← 返回列表

arXiv 提交日期: 2026-01-26

📄 Abstract - VIBEVOICE-ASR Technical Report

This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving accuracy on domain-specific terminology and polyphonic character disambiguation.

顶级标签: audio natural language processing systems

VIBEVOICE-ASR技术报告 / VIBEVOICE-ASR Technical Report

1️⃣ 一句话总结

这篇论文提出了一个名为VibeVoice-ASR的通用语音理解框架，它能够一次性处理长达60分钟的音频，将语音识别、说话人分离和时间戳生成整合成一个任务，支持多种语言和混合语言场景，并能通过用户提供的提示信息来提高专业术语和歧义词汇的识别准确率。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2601.18184

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要