End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions

📄 Abstract - End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions

Accurate transcription and speaker diarization of child-adult spoken interactions are crucial for developmental and clinical research. However, manual annotation is time-consuming and challenging to scale. Existing automated systems typically rely on cascaded speaker diarization and speech recognition pipelines, which can lead to error propagation. This paper presents a unified end-to-end framework that extends the Whisper encoder-decoder architecture to jointly model ASR and child-adult speaker role diarization. The proposed approach integrates: (i) a serialized output training scheme that emits speaker tags and start/end timestamps, (ii) a lightweight frame-level diarization head that enhances speaker-discriminative encoder representations, (iii) diarization-guided silence suppression for improved temporal precision, and (iv) a state-machine-based forced decoding procedure that guarantees structurally valid outputs. Comprehensive evaluations on two datasets demonstrate consistent and substantial improvements over two cascaded baselines, achieving lower multi-talker word error rates and demonstrating competitive diarization accuracy across both Whisper-small and Whisper-large models. These findings highlight the effectiveness and practical utility of the proposed joint modeling framework for generating reliable, speaker-attributed transcripts of child-adult interactions at scale. The code and model weights are publicly available

面向儿童-成人交互的端到端联合语音识别与说话人角色划分 / End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions

1️⃣ 一句话总结

这篇论文提出了一种端到端的统一模型，能够同时完成语音识别和区分儿童与成人说话者的任务，相比传统串联式方法，它能减少错误传播，更高效、准确地生成带说话人标签的对话文本。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要