菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-29
📄 Abstract - Text-Utilization for Encoder-dominated Speech Recognition Models

This paper investigates efficient methods for utilizing text-only data to improve speech recognition, focusing on encoder-dominated models that facilitate faster recognition. We provide a comprehensive comparison of techniques to integrate text-only data, including modality matching and dynamic downsampling to reach text-level representations within the encoder. Our experiments on the LibriSpeech corpus show that a larger encoder with a smaller decoder can equal or surpass the performance of architectures with larger decoders. We demonstrate that simple configurations, such as random duration models, are often more effective than complex alternatives, significantly simplifying the training pipeline. All code and recipes are made publicly available.

顶级标签: audio machine learning model training
详细标签: speech recognition text-only data integration encoder-dominated models modality matching dynamic downsampling 或 搜索:

面向编码器主导型语音识别模型的文本利用方法 / Text-Utilization for Encoder-dominated Speech Recognition Models


1️⃣ 一句话总结

本文研究如何在以编码器为核心的语音识别模型中高效利用纯文本数据,通过模态匹配和动态降采样等技术,用更简单的配置(如随机时长模型)实现比复杂方法更好的识别效果,并证明了增大编码器、缩小解码器也能达到甚至超越传统大解码器架构的性能。

源自 arXiv: 2604.26514