菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-27
📄 Abstract - SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper

Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a major challenge. While some approaches achieve strong performance when fine-tuned on specific domains, few systems generalize well across out-of-domain datasets. Our prior work, Diarization-Conditioned Whisper (DiCoW), leverages speaker diarization outputs as conditioning information and, with minimal fine-tuning, demonstrated strong multilingual and multi-domain performance. In this paper, we address a key limitation of DiCoW: ambiguity in Silence-Target-Non-target-Overlap (STNO) masks, where two or more fully overlapping speakers may have nearly identical conditioning despite differing transcriptions. We introduce SE-DiCoW (Self-Enrolled Diarization-Conditioned Whisper), which uses diarization output to locate an enrollment segment anywhere in the conversation where the target speaker is most active. This enrollment segment is used as fixed conditioning via cross-attention at each encoder layer. We further refine DiCoW with improved data segmentation, model initialization, and augmentation. Together, these advances yield substantial gains: SE-DiCoW reduces macro-averaged tcpWER by 52.4% relative to the original DiCoW on the EMMA MT-ASR benchmark.

顶级标签: audio natural language processing systems
详细标签: speaker diarization speech recognition multi-speaker asr cross-attention model conditioning 或 搜索:

SE-DiCoW:自注册的说话人分割条件化Whisper模型 / SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper


1️⃣ 一句话总结

本文提出了一种改进的语音识别方法,通过自动选取对话中说话人最活跃的片段作为固定参考,有效解决了多人重叠说话时身份混淆的问题,从而在多语言、多场景的语音转写任务中大幅提升了准确率。

源自 arXiv: 2601.19194