📄
Abstract - An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization
Bengali remains a low-resource language in speech technology, especially for complex tasks like long-form transcription and speaker diarization. This paper presents a multistage approach developed for the "DL Sprint 4.0 - Bengali Long-Form Speech Recognition" and "DL Sprint 4.0 - Bengali Speaker Diarization" competitions on Kaggle, addressing the challenge of "who spoke when/what" in hour-long recordings. We implemented Whisper Medium fine-tuned on Bengali data (bengaliAI/tugstugi bengaliai-asr whisper-medium) for transcription and integrated pyannote/speaker-diarization-community-1 with our custom-trained segmentation model to handle diverse and noisy acoustic environments. Using a two-pass method with hyperparameter tuning, we achieved a DER of 0.27 on the private leaderboard and 0.19 on the public leaderboard. For transcription, chunking, background noise cleaning, and algorithmic post-processing yielded a WER of 0.38 on the private leaderboard. These results show that targeted tuning and strategic data utilization can significantly improve AI inclusivity for South Asian languages. All relevant code is available at: this https URL Index Terms: Bengali speech recognition, speaker diarization, Whisper, ASR, low-resource languages, pyannote, voice activity detection
孟加拉语长篇幅语音转录与说话人日志化的多种方法研究 /
An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization
1️⃣ 一句话总结
本研究针对孟加拉语这一低资源语言,通过结合微调Whisper模型进行语音转录和集成pyannote模型进行说话人分离的多阶段方法,有效解决了长达一小时的录音中‘谁在何时说了什么’的难题,显著提升了相关AI任务的性能。