SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper

📄 Abstract - SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper

Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a major challenge. While some approaches achieve strong performance when fine-tuned on specific domains, few systems generalize well across out-of-domain datasets. Our prior work, Diarization-Conditioned Whisper (DiCoW), leverages speaker diarization outputs as conditioning information and, with minimal fine-tuning, demonstrated strong multilingual and multi-domain performance. In this paper, we address a key limitation of DiCoW: ambiguity in Silence-Target-Non-target-Overlap (STNO) masks, where two or more fully overlapping speakers may have nearly identical conditioning despite differing transcriptions. We introduce SE-DiCoW (Self-Enrolled Diarization-Conditioned Whisper), which uses diarization output to locate an enrollment segment anywhere in the conversation where the target speaker is most active. This enrollment segment is used as fixed conditioning via cross-attention at each encoder layer. We further refine DiCoW with improved data segmentation, model initialization, and augmentation. Together, these advances yield substantial gains: SE-DiCoW reduces macro-averaged tcpWER by 52.4% relative to the original DiCoW on the EMMA MT-ASR benchmark.

SE-DiCoW：自注册的说话人分割条件化Whisper模型 / SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper

1️⃣ 一句话总结

本文提出了一种改进的语音识别方法，通过自动选取对话中说话人最活跃的片段作为固定参考，有效解决了多人重叠说话时身份混淆的问题，从而在多语言、多场景的语音转写任务中大幅提升了准确率。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要