面向儿童-成人交互的端到端联合语音识别与说话人角色划分 / End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions
1️⃣ 一句话总结
这篇论文提出了一种端到端的统一模型,能够同时完成语音识别和区分儿童与成人说话者的任务,相比传统串联式方法,它能减少错误传播,更高效、准确地生成带说话人标签的对话文本。
Accurate transcription and speaker diarization of child-adult spoken interactions are crucial for developmental and clinical research. However, manual annotation is time-consuming and challenging to scale. Existing automated systems typically rely on cascaded speaker diarization and speech recognition pipelines, which can lead to error propagation. This paper presents a unified end-to-end framework that extends the Whisper encoder-decoder architecture to jointly model ASR and child-adult speaker role diarization. The proposed approach integrates: (i) a serialized output training scheme that emits speaker tags and start/end timestamps, (ii) a lightweight frame-level diarization head that enhances speaker-discriminative encoder representations, (iii) diarization-guided silence suppression for improved temporal precision, and (iv) a state-machine-based forced decoding procedure that guarantees structurally valid outputs. Comprehensive evaluations on two datasets demonstrate consistent and substantial improvements over two cascaded baselines, achieving lower multi-talker word error rates and demonstrating competitive diarization accuracy across both Whisper-small and Whisper-large models. These findings highlight the effectiveness and practical utility of the proposed joint modeling framework for generating reliable, speaker-attributed transcripts of child-adult interactions at scale. The code and model weights are publicly available
面向儿童-成人交互的端到端联合语音识别与说话人角色划分 / End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions
这篇论文提出了一种端到端的统一模型,能够同时完成语音识别和区分儿童与成人说话者的任务,相比传统串联式方法,它能减少错误传播,更高效、准确地生成带说话人标签的对话文本。
源自 arXiv: 2601.17640